Re: [xml] html parsing incomplete - bug?
- From: "Martin (gzlist)" <gzlist googlemail com>
- To: Stefan Behnel <stefan_ml behnel de>
- Cc: xml gnome org, Lydia Patrovic <lydia patrovic rbcmail ru>
- Subject: Re: [xml] html parsing incomplete - bug?
- Date: Tue, 13 Oct 2009 13:22:12 +0100
On 13/10/2009, Stefan Behnel <stefan_ml behnel de> wrote:
I wonder why the parser stops parsing here, though. Is '\0' explicitly
considered an invalid character in (broken) HTML, or is it really just the
usual C EOS slip?
It's certainly invalid, though could be recoverable.
In the various html versions: HTML 4 defers to the SGML spec which I'm
not rich enough to consult, XHTML 1 defers to XML which we all know
says nulls are verboten, and the current HTML 5 draft is pretty clear:
<http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream>
"All U+0000 NULL characters in the input must be replaced by U+FFFD
REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse
error."
(this is all in the context of an decoded-to-unicode stream, not raw
UTF-16 etc.)
Martin
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]