Re: [xml] HTMLparser: whitespace in <body> tags

Whoops, forgot to CC this to the list.

----------  Forwarded Message  ----------

Subject: Re: [xml] HTMLparser: whitespace in <body> tags
Date: September 25, 2004 11:33 pm
From: Benj Carson <benjcarson digitaljunkies ca>
To: Malcolm Tredinnick <malcolm commsecure com au>

Thanks for the pointers.

Section 9.1 of the HTML 4.01 spec says that although consecutive
whitespace occurrences can be collapsed to a single whitespace character
(outside of PRE elements), inter-word whitespace should still be
displayed. So I don't think libxml should be eating all of the spaces

I've written a patch that solves this problem for me (see below).  I
chose to prevent the creation of implicit <p> tags by removing "body"
from htmlNoContentElements.  This may or may not be desired, but made
the most sense to me.

This is tricky to say: the problem is that you are using the HTML 4.01
Loose DTD and libxml implements parsing according to HTML 4.01 Strict
DTD. In 4.01 Loose, the body element is defined as:

        <!ELEMENT BODY O O (%flow;)* +(INS|DEL) -- document body -->
        <!ENTITY % flow "%block; | %inline;">
        <!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; |

So a body element can contain PCDATA directly, which appears at odds
with its inclusion in htmlNoContentElements. However, in 4.01 Strict,
the body element is:

        !ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body

and you cannot have inline content directly in the body (and even
assuming an implicit P element is not strictly valid -- opening P tags
are required).

Right this makes sense to me.  Unfortunately, choosing a specific DTD or
reformatting my input isn't really a viable option for me in this case.
I'm actually writing an HTML to PDF converter using PHP (using PHP's
DomDocument extension, which in turn uses libxml).  I'd like the converter
to be as forgiving as possible in terms of its input, and to behave as much
like a normal web browser as it can.  For the most part, libxml really
shines here since it is quite tolerant to all the malformed HTML that
people like to write, and it saves me having to write a (slow, and likely
poor-quality) HTML parser in PHP.

To get this completely correct, I think you need to teach
libxml to detect the DTD you are using and adapt appropriately.

This also makes sense to me.  I'm kinda new to hacking libxml, but if this
is what needs to happen I'm willing to give it a shot.  Of course if I
shouldn't bother because someone else is already working in that direction,
or if it's not a feature that's really wanted, I'll leave it alone ;-).




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]