Re: [xml] html parsing incomplete - bug?

From: Daniel Veillard <veillard redhat com>
To: "Martin (gzlist)" <gzlist googlemail com>
Cc: xml gnome org, Lydia Patrovic <lydia patrovic rbcmail ru>
Subject: Re: [xml] html parsing incomplete - bug?
Date: Tue, 13 Oct 2009 14:38:50 +0200

On Tue, Oct 13, 2009 at 01:22:12PM +0100, Martin (gzlist) wrote:

On 13/10/2009, Stefan Behnel <stefan_ml behnel de> wrote:


I wonder why the parser stops parsing here, though. Is '\0' explicitly
considered an invalid character in (broken) HTML, or is it really just the
usual C EOS slip?


It's certainly invalid, though could be recoverable.

In the various html versions: HTML 4 defers to the SGML spec which I'm
not rich enough to consult, XHTML 1 defers to XML which we all know
says nulls are verboten, and the current HTML 5 draft is pretty clear:

<http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream>

"All U+0000 NULL characters in the input must be replaced by U+FFFD
REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse
error."

(this is all in the context of an decoded-to-unicode stream, not raw
UTF-16 etc.)


  When HTML5 will become a Last Call draft or something then I think it
will make sense to try to update the parser to use the same recovery
tricks.
  Note that the 0 in content may have cut the input at the Python->C
interface layer. But sure libxml2 internals don't like 0 in content.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Follow-Ups:
- Re: [xml] html parsing incomplete - bug?
  - From: Stefan Behnel

References:
- Re: [xml] html parsing incomplete - bug?
  - From: Stefan Behnel
- Re: [xml] html parsing incomplete - bug?
  - From: Martin (gzlist)
- Re: [xml] html parsing incomplete - bug?
  - From: Stefan Behnel
- Re: [xml] html parsing incomplete - bug?
  - From: Martin (gzlist)

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]