Re: [xml] html parsing incomplete - bug?

From: Stefan Behnel <stefan_ml behnel de>
To: "Martin (gzlist)" <gzlist googlemail com>
Cc: xml gnome org, Lydia Patrovic <lydia patrovic rbcmail ru>
Subject: Re: [xml] html parsing incomplete - bug?
Date: Tue, 13 Oct 2009 13:56:06 +0200


Martin (gzlist) wrote:

On 13/10/2009, Stefan Behnel <stefan_ml behnel de> wrote:

Lydia Patrovic wrote:

Note the "main&amp;20090924_2" attribute value, which can be interpreted
as an
unterminated entity.

:) Nice little Freudian copy&paste quoting error. Here's the line from the
real 'HTML' file:

<script type="text/javascript" src="merge.php?f=main&20090924_2"></script>

Note the unescaped '&' character in the URL.


I'd have thought the embedded null at byte 532 would be the cause. Try
bytes.replace("\x00", "") before treating it as a c string. Seems to
get the document parsed pretty much as expected for me.


Interesting. Sounds totally like the right solution.

I wonder why the parser stops parsing here, though. Is '\0' explicitly
considered an invalid character in (broken) HTML, or is it really just the
usual C EOS slip?

Stefan

Follow-Ups:
- Re: [xml] html parsing incomplete - bug?
  - From: Martin (gzlist)

References:
- Re: [xml] html parsing incomplete - bug?
  - From: Stefan Behnel
- Re: [xml] html parsing incomplete - bug?
  - From: Martin (gzlist)

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]