Re: [libxml++] HTML Parser Subclass (based on domparser.cc)

From: Christophe de VIENNE <cdevienne alphacent com>
To: libxmlplusplus-general lists sourceforge net, laurenth gmx net
Subject: Re: [libxml++] HTML Parser Subclass (based on domparser.cc)
Date: Wed, 03 Mar 2004 11:49:09 +0100

Hi Laurent,

This looks interesting, I'll have a closer look on it a bit later.

However the integration into libxml++ will not occur in 2.6, since wefroze the API on monday.We'll soon discuss the start of a new unstable branch, in which theHtmlParser should probably have its place.


Thanks,

Christophe


Laurent Hoss a écrit :

Hi all,
I discovered the cool libxml++ yesterday on my quest for the best C++XML Parser (bindings, coz libxml2 seems to be the best C parser anyway;). Libxml is not new to me though, I used it extensively in Perlthanx to the very complete XML::LibXML CPAN Module.Now one of my main motivations is to parse HTML Files into a DOM treewhere I can extract nodes with XPATH.
In perl that was easy , it has the html parser included.
Therefore after a thorough search in the API I was a bit disappointedthat there was no HTML Parser support in libxml++...but thanks to the clean API's of libxml(++) and after a littlereading , I had no difficulties at all building my own subclass (basedon domparser.cc) except some little quirks (like extra encodingparameter in some html parser functions) :)
In fact libxml2 has a really tolerant html parser (I used it in perlfor mirroring/parsing whole dynamic websites :D ), it even returns agood XML Document when it had parser Errors, but to get a Doc returnedin such a case one has to turn off the 'wellformedness' check, which Idid in my temporary htmlparser Implementation.( Unfort. there's always a segfault at the end of a run of my edited'dom_xpath/main.cc' html parsing example app , when ignoring'!context_->wellFormed' ?! experimenting done in'HtmlParser::parse_context' method )
I hope HTML Parsing can be included in the main distr. ( maybe betterwith wellFormed check on )...To compile the whole library with my htmlparser class, I added theclass in all the files (Makefile.am files, libxml++.h...) containing'domparser'.
Included are the c++ and include files of htmlparser class (or shouldI've taken diffs from the domparser.cc/h originals ?) plus my htmlparsing example, which shows all the //a[ href] links with theirattribute contents.
Hopefully the segfault can be easily solved with the knowledge of thelead developpers ( I don't have yet ;).I guess its just something I'm missing, else I'll try to find themem.leak using a debugger (or is there a better way ??)
Thanx,
Laurent

References:
- [libxml++] HTML Parser Subclass (based on domparser.cc)
  - From: Laurent Hoss

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]