Re: [xml] htmlParseFile vs htmlParseDoc



Yes, here is a little program showing the problem. 
Test it with the data at:

http://vivisimo.com/~pesenti/pubmed.xml

Jerome

###################################################


#include <stdlib.h>
#include <stdio.h>

#include "libxml/parserInternals.h"
#include "libxml/xmlmemory.h"
#include "libxml/debugXML.h"
#include "libxml/HTMLtree.h"
#include "libxml/xmlIO.h"
#include "libxml/DOCBparser.h"
#include "libxml/xinclude.h"
#include "libxml/xmlerror.h"
#include "libxml/catalog.h"
#include "libxml/tree.h"
#include "libxml/HTMLparser.h"

#define SIZE (10*1024*1024)

char *
read_all(const char *name)
{
    FILE *f = fopen(name, "r");
    char *s;
    int n;

    if (! f) {
        perror(name);
        exit(1);
    }

    s = malloc(sizeof(*s)*SIZE);
    if ((n = fread(s, 1, SIZE, f)) < 0 || n == SIZE) {
        /* error or file too large */
        fprintf(stderr, "Could not load file, got %d
bytes\n", n);
        exit(1);
    }

    s[n] = '\0';
    return s;
}



int
main(int argc, char **argv)
{

    const char *str;
    htmlParserCtxtPtr ctxt;
    htmlDocPtr ret;

    str = read_all(argv[1]);

    printf("%ld - parsing with htmlParseDocument\n",
time(NULL));
    ctxt = htmlCreateMemoryParserCtxt(str,
strlen(str));
    htmlParseDocument(ctxt);
    printf("%ld - parsing with htmlParseDoc\n",
time(NULL));
    htmlParseDoc(str, NULL);
    printf("%ld - parsing with htmlParseFile\n",
time(NULL));
    htmlParseFile(argv[1], NULL);
    printf("%ld - done\n", time(NULL));

    return 0;
}






--- Daniel Veillard <veillard redhat com> wrote:
On Wed, Oct 15, 2003 at 03:35:27PM -0700, Jerome
Pesenti wrote:
There seems to be a big performance hit (x4 to
x10)
when using htmlParseDoc instead of htmlParseFile
on
big files.

Interestingly, in 2.5.8 there was the same kind of

problem when parsing XML, but it seems to be fixed
in
2.5.11.

Any idea what's going on?

  Hum, I think it was about some buffer copies as
the 
input was consumed when the buffer was initialized
with a very
large input. Can you still see this with the CVS
version ?

Daniel

-- 
Daniel Veillard      | Red Hat Network
https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit
 http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine
http://rpmfind.net/


__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]