Hello all,
First of all I'd like to say that I'm a new member
to this list. Started working with libxml2 3 days ago and so far I'm very happy
with it. I'm working with it under windows using Borland's C++
Builder.
My main use is with the htmlParser. I'm processing
thousands of html pages and running them with libxslt to get my desired
output. Now I have a few problems here that I hope you could help me
with:
1) Right now I'm simply using htmlParseDoc with
"encoding=NULL" to build the tree I need for the xsl engine. This function gives
me a well-formed tree but not valid at all, I wanted to know if there's an
option to use the htmlParser to build also a valid document.
2) Is there anyway to speed up the work of
htmlParser? I'm not using any options and only calling htmlParseDoc. The thing
that worries me is that I've also tested a seperate library called
HtmlAgilityPack which is managed code and it processes a html file faster than
the libxml's html parser AND outputs a well-formed+valid tree. From my tests
libxml has an amazing performance on xml and xsl files so I don't understand how
a managed and marshalled code can work better and faster. I must be doing
something wrong, maybe the htmlParser is not intended for valid trees which is
also fine by me but I'd like it atleast to be faster.
I really hope to get some answers. I fell in love
with this tool and I want to use it but performance is my main issue here and
I'd hate to use alternatives.
Thank you very much
Liron
|