Re: [xml] htmlParser questions

From: Daniel Veillard <veillard redhat com>
To: Liron <magilam netvision net il>
Cc: xml gnome org
Subject: Re: [xml] htmlParser questions
Date: Tue, 24 Jan 2006 07:32:05 -0500

On Mon, Jan 23, 2006 at 04:03:19PM +0100, Liron wrote:

1) Right now I'm simply using htmlParseDoc with "encoding=NULL" to build the tree I need for the xsl 
engine. This function gives me a well-formed tree but not valid at all, I wanted to know if there's an 
option to use the htmlParser to build also a valid document.


  valid in which sense ? SGML DTD validity is way too complex.
You could use the XML serialization to get XML well-formedness of the output.
But IMHO since HTML is just the input, the validity concern should be
on the XSLT result and validity there can be insured by the stylesheets design.

2) Is there anyway to speed up the work of htmlParser? I'm not using any options and only calling 
htmlParseDoc. The thing that worries me is that I've also tested a seperate library called HtmlAgilityPack 
which is managed code and it processes a html file faster than the libxml's html parser AND outputs a 
well-formed+valid tree. From my tests libxml has an amazing performance on xml and xsl files so I don't 
understand how a managed and marshalled code can work better and faster. I must be doing something wrong, 
maybe the htmlParser is not intended for valid trees which is also fine by me but I'd like it atleast to be 
faster.


  The HTML parser should be that much slower than the XML parser, maybe there
is a problem introduced recently in the the code. But it's the first time I 
hear a complain about the HTML parser speed, strange. Maybe a bit of profiling
could help understanding what's happening.

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

References:
- [xml] htmlParser questions
  - From: Liron

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]