[xml] Libxml2 HTML parsing

From: "Stan Santiago" <SSantiago adinfocenter com>
To: <xml gnome org>
Subject: [xml] Libxml2 HTML parsing
Date: Thu, 11 Nov 2010 16:36:44 -0500

Title: Libxml2 HTML parsing

Greetings.

I just started using the Libxml2 library for HTML parsing. One of the requirements is to parse multiple HTML fragments separately and
combine the fragments into a single HTML document at the end. However, the <html/>, <body/> tags get added to each fragment that is processed.

I was looking at the thread at http://mail.gnome.org/archives/xml/2010-January/msg00112.html and it seems like this is exactly the same issue I have. I thought adding the
HTML_PARSE_NOIMPLIED option would resolve the issue but that doesn't seem to work.. In fact, the htmlCtxtUseOption(...) function doesn't
recognize the HTML_PARSE_NOIMPLIED option.

Here is part of the source code I've written. I'm using the latest LibXML2 2.7.8 version. The following code is executed for
each HTML fragment that is processed.

...
htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL,0, NULL, 0);
int i = htmlCtxtUseOptions(parser, HTML_PARSE_RECOVER |HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NOIMPLIED);
printf("HTML CTXT %d\n",i); //prints 8192 which corresponds to HTML_PARSE_NOIMPLIED
htmlParseChunk(parser, htmlFragment, strlen(htmlFragment), 0);
...
htmlNodeDump(buffer, doc, xmlDocGetRootElement(doc));; //Adds <html> and <body> tags for each fragment!

Any pointers or suggestions on how to work around this issue?

Thanks!
Stan

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]