Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
- From: Bruce Miller <bruce miller nist gov>
- To: iSteve <isteve deadcd org>
- Cc: xml gnome org
- Subject: Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
- Date: Thu, 12 Jan 2006 11:20:33 -0500
Anyway, I do not see any reason why parser should mess with the document
in first place; it's supposed to parse it, not alter it deliberately
according to what it thinks that may be the right solution. Could
someone please explain me why to alter the document?
Well, since I'm mostly free-loading here, I'll offer an explanation
of the _general_ question of "why [a] parser should mess with...":
The way I look at it, the thing you're calling the "document" is
actually a serialization of an abstract Document. From that POV,
the parser isn't _changing_ the document, it's _deducing_ it from
Now, the main point of XML was to improve the correspondence
between the serialization and the document. But SGML
(which html is), has a notion of implicitly opened and closed elements,
to make the serializations easier to type. This means things
like: If we find plain text where it shouldn't be, but a <p> can
be, then we'll open a <p> first. And so on. In such cases,
my point of view is that the document includes the <p>, even though
the serialization doesn't. Furthermore, this notion by itself
doesn't guarantee that the document is unique!
Of course, given this liberty, and the use of HTML by millions of
people who didn't understand this concept, we've ended up with
what is fondly (not!) refered to as Tag Soup. And html parsers
in browsers go to great lengths to guess what the author might have meant.
To come back to your specific point about plain text in the body.
Frankly, I'm surprised that the html dtd is defined to allow it,
and it would appear that the <p> needn't be added, and perhaps shouldn't.
I don't know what Daniel had in mind in creating the <p>; usually
he has good reasons. Personally, I avoid parsing non-xml html like
And please, do not say "to be compliant with standards", because
standards to my best knowledge do not require the parser to "fix" the
document (though I may be wrong, I doubt standards would require such a
thing) by adding tags in case it's not considered correct.
With XML, I think parsers are more required NOT to fix a document.
With html, is it "fixing" or "deducing" the document?
Maybe I haven't told you anything you don't already know, and this
message comes across as condescending, but my point is that there
are several things that might be called "The Document":
* the text in the file
* a tree structure that breaks up the text in the file on tags
* A tree structure breaking up the text, and corresponding to the DTD.
Maybe you're content with the 2nd, rather than the 3rd?
PS.: The <p> tag injection is not correct anyway. "<img>" tag is inline,
yet, not wrapped into <p>. Still want to keep it?
For details, see: http://www.w3.org/TR/REC-html40/sgml/dtd.html#inline
'<!ENTITY % special "A | IMG | OBJECT | BR | SCRIPT | MAP | Q | SUB |
SUP | SPAN | BDO">
<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; |
<!ENTITY % block "P | %heading; | %list; | %preformatted; | DL | DIV |
NOSCRIPT | BLOCKQUOTE | FORM | HR | TABLE | FIELDSET | ADDRESS">'
xml mailing list, project page http://xmlsoft.org/
xml gnome org
bruce miller nist gov
] [Thread Prev