Re: [xml] HTMLparser: whitespace in <body> tags

On Sun, Sep 26, 2004 at 12:17:12PM +1000, Malcolm Tredinnick wrote:
One constant on this list is that whenever I try to claim anything based
on the specs, I inevitably screw up and Daniel corrects me. It's good
for my humility. :-)

  Well ... I think I understand XML, but I know I don't understand 
HTML in its glory details (sic.)

So don't read too much into the above, but I have tried to give you some
references that might be useful. I'm not really sure what the solution
here is, since my understanding is that true HTML (non-XHTML) parsing is
kind of a value-add in libxml and not as fully implemented as XML
parsing. To get this completely correct, I think you need to teach
libxml to detect the DTD you are using and adapt appropriately.

  There is a huge set of HTML related data in HTMLparser.c, it's mostly
external contributions, and I am very reluctant to touch them since I know
1/ it's very easy to break up things in nasty ways ... the nastyness is
in the bug reports I get back when this occurs 2/ it's even harder to try
to fix them without breaking someone else expected pattern.
  HTML parsing sucks, it's a fact, and libxml2 own way of doing things
is not gonna improve on that baseline, unfortunately.


Daniel Veillard      | Red Hat Desktop team
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]