Re: [xml] html parsing incomplete - bug?




Daniel Veillard wrote:
On Tue, Oct 13, 2009 at 01:22:12PM +0100, Martin (gzlist) wrote:
On 13/10/2009, Stefan Behnel wrote:
I wonder why the parser stops parsing here, though. Is '\0' explicitly
considered an invalid character in (broken) HTML, or is it really just the
usual C EOS slip?
It's certainly invalid, though could be recoverable.

In the various html versions: HTML 4 defers to the SGML spec which I'm
not rich enough to consult, XHTML 1 defers to XML which we all know
says nulls are verboten, and the current HTML 5 draft is pretty clear:

<http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream>

"All U+0000 NULL characters in the input must be replaced by U+FFFD
REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse
error."

(this is all in the context of an decoded-to-unicode stream, not raw
UTF-16 etc.)

  When HTML5 will become a Last Call draft or something then I think it
will make sense to try to update the parser to use the same recovery
tricks.

In any case, the parser should either apply the above replacement rule or
report an error when encountering a '\0' byte in the input stream.
Currently, it just silently terminates.


Note that the 0 in content may have cut the input at the Python->C
interface layer. But sure libxml2 internals don't like 0 in content.

We also pass UCS4 encoded data though the same code, so, no, that's not an
issue here.

Stefan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]