Re: [libxml++] HTML Parser Subclass (based on domparser.cc)



Hi Laurent,

This looks interesting, I'll have a closer look on it a bit later.
However the integration into libxml++ will not occur in 2.6, since we froze the API on monday. We'll soon discuss the start of a new unstable branch, in which the HtmlParser should probably have its place.

Thanks,

Christophe


Laurent Hoss a écrit :

Hi all,

I discovered the cool libxml++ yesterday on my quest for the best C++ XML Parser (bindings, coz libxml2 seems to be the best C parser anyway ;). Libxml is not new to me though, I used it extensively in Perl thanx to the very complete XML::LibXML CPAN Module. Now one of my main motivations is to parse HTML Files into a DOM tree where I can extract nodes with XPATH.
In perl that was easy , it has the html parser included.
Therefore after a thorough search in the API I was a bit disappointed that there was no HTML Parser support in libxml++... but thanks to the clean API's of libxml(++) and after a little reading , I had no difficulties at all building my own subclass (based on domparser.cc) except some little quirks (like extra encoding parameter in some html parser functions) :)

In fact libxml2 has a really tolerant html parser (I used it in perl for mirroring/parsing whole dynamic websites :D ), it even returns a good XML Document when it had parser Errors, but to get a Doc returned in such a case one has to turn off the 'wellformedness' check, which I did in my temporary htmlparser Implementation. ( Unfort. there's always a segfault at the end of a run of my edited 'dom_xpath/main.cc' html parsing example app , when ignoring '!context_->wellFormed' ?! experimenting done in 'HtmlParser::parse_context' method )

I hope HTML Parsing can be included in the main distr. ( maybe better with wellFormed check on )... To compile the whole library with my htmlparser class, I added the class in all the files (Makefile.am files, libxml++.h...) containing 'domparser'.

Included are the c++ and include files of htmlparser class (or should I've taken diffs from the domparser.cc/h originals ?) plus my html parsing example, which shows all the //a[ href] links with their attribute contents.

Hopefully the segfault can be easily solved with the knowledge of the lead developpers ( I don't have yet ;). I guess its just something I'm missing, else I'll try to find the mem.leak using a debugger (or is there a better way ??)

Thanx,
Laurent






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]