Re: [xml] xmllint and HTML

From: Nick Kew <nick webthing com>
To: Daniel Veillard <veillard redhat com>
Cc: Alberto Manuel Brandão Simões <albie alfarrabio di uminho pt>, <xml gnome org>
Subject: Re: [xml] xmllint and HTML
Date: Fri, 31 Oct 2003 01:00:41 +0000 (GMT)

On Thu, 30 Oct 2003, Daniel Veillard wrote:

On Thu, Oct 30, 2003 at 02:59:42PM +0000, Alberto Manuel Brandão Simões wrote:

Hi!

Although I may seem complaining, do not see it like that. Maybe it is my
bad English :)

1. I was used to use xmllint --html to process HTML or generic HTML
files with some special (mine) tags. Now, xmllint complains (ok, I can
ignore the complains :D but it would be nice to have a generic
pseudo-html processor)


  That's a BAD idea !!!  Suppose the libxml2 HTML parser sees

  <p>
  <foo>

Does <foo> closes <p> ? is <foo> itself expected to be closed ?


That would be defined by DTD.  Or it could be built in to the processor,
as HTML4 is in libxml's htmlParser.

It's entirely possible to use <foo/> with htmlParser and get meaningful
behaviour.  The parser will, by default, generate the SAX events as-if
it were XML.

If you want to extend the base syntax *use XML* ! It was designed
precisely to overcome the limitation that an SGML HTML parser has.


On the contrary, an SGML parser has fewer limitations, due to the
far greater flexibility and expressiveness of the language.  For
serious HTML work - and especially for nonstandard DTDs, the answer
is to use a true SGML parser.  OpenSP is the obvious recommendation,
though if it's just HTML with a few nonstandard elements then a
heuristic parser (libxml2/htmlParser or Tidy) may serve.

So "switch to XHTML" is the only acceptable answer at this point,
any other design decision is just broken, sorry !


Only if the design constrains you to use an XML-based processor.
Which is fair enough in the context of this list, but not so much
in the wider world.

2. The second problem is that an empty tag (again with xmllint --html)
gives complains: for example
 _.xml:2: error: Unexpected end tag : hr
 <hr></hr>


Is that in your markup, or is xmllint inserting a spurious </hr>?

-- 
Nick Kew

Follow-Ups:
- Re: [xml] xmllint and HTML
  - From: Daniel Veillard

References:
- Re: [xml] xmllint and HTML
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]