Re: [xml] xmllint --html --xmlout

On Mon, Feb 12, 2007 at 07:42:13AM -0500, Elliotte Harold wrote:
How robust is

xmllint --html --xmlout

Is it possible to confuse it so badly it won't continue or will generate 
ill-formed markup? Or will it keep on trucking no matter what?

  The HTML parser will generate an in-memory tree, no matter what. 
The tree may be bizarre from an XML perspective as a result. The
XML serializer don't try to detect error conditions, though we have
fixed some case where the two options generated non-well formed XML
in the past and corrected them.

How does the HTML parser handle bogons (unrecognized elements)? Are they 
treated as empty or dropped or something else?

  The HTML parser will try to preserve as much data as possible in the
case of errors.

How good an alternative is this for TagSoup and Tidy?

  I would have to understand TagSoup and Tidy internals to answer this,
so I can't. Point is that libxml2 HTML parser won't really try to 'fix'
the input, it will raise errors message when facing things it doesn't
understand, most of the policies about how to correct problems are IMHO
dependant on the use case and there is the tree API to fix things accordingly
to needs.

I'm working on a book about converting messy old HTML to clean XHTML, 
and I'm trying to decide exactly how much of each tool to recommend when.

  libxml2 HTML parser has been used for many real world tools, like HTML
indexers, it will consume mostly anything, but it doesn't try to add much
correcting recipes on top of it. This was discussed on the list a couple
of years ago, and that's where libxml2 HTML parsing error handling principle
were set up.


Red Hat Virtualization group
Daniel Veillard      | virtualization library
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]