Re: [libxml++] HTML Parser Subclass (based on domparser.cc)
- From: Christophe de VIENNE <cdevienne alphacent com>
- To: libxmlplusplus-general lists sourceforge net, laurenth gmx net
- Subject: Re: [libxml++] HTML Parser Subclass (based on domparser.cc)
- Date: Wed, 03 Mar 2004 11:49:09 +0100
This looks interesting, I'll have a closer look on it a bit later.
However the integration into libxml++ will not occur in 2.6, since we
froze the API on monday.
We'll soon discuss the start of a new unstable branch, in which the
HtmlParser should probably have its place.
Laurent Hoss a écrit :
I discovered the cool libxml++ yesterday on my quest for the best C++
XML Parser (bindings, coz libxml2 seems to be the best C parser anyway
;). Libxml is not new to me though, I used it extensively in Perl
thanx to the very complete XML::LibXML CPAN Module.
Now one of my main motivations is to parse HTML Files into a DOM tree
where I can extract nodes with XPATH.
In perl that was easy , it has the html parser included.
Therefore after a thorough search in the API I was a bit disappointed
that there was no HTML Parser support in libxml++...
but thanks to the clean API's of libxml(++) and after a little
reading , I had no difficulties at all building my own subclass (based
on domparser.cc) except some little quirks (like extra encoding
parameter in some html parser functions) :)
In fact libxml2 has a really tolerant html parser (I used it in perl
for mirroring/parsing whole dynamic websites :D ), it even returns a
good XML Document when it had parser Errors, but to get a Doc returned
in such a case one has to turn off the 'wellformedness' check, which I
did in my temporary htmlparser Implementation.
( Unfort. there's always a segfault at the end of a run of my edited
'dom_xpath/main.cc' html parsing example app , when ignoring
'!context_->wellFormed' ?! experimenting done in
'HtmlParser::parse_context' method )
I hope HTML Parsing can be included in the main distr. ( maybe better
with wellFormed check on )...
To compile the whole library with my htmlparser class, I added the
class in all the files (Makefile.am files, libxml++.h...) containing
Included are the c++ and include files of htmlparser class (or should
I've taken diffs from the domparser.cc/h originals ?) plus my html
parsing example, which shows all the //a[ href] links with their
Hopefully the segfault can be easily solved with the knowledge of the
lead developpers ( I don't have yet ;).
I guess its just something I'm missing, else I'll try to find the
mem.leak using a debugger (or is there a better way ??)
] [Thread Prev