[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] strange encoding behavior when parsing HTML files
- From: Aaron Patterson <aaron patterson gmail com>
- To: xml gnome org
- Subject: Re: [xml] strange encoding behavior when parsing HTML files
- Date: Fri, 17 Apr 2009 09:39:22 -0700
On Fri, Apr 17, 2009 at 1:53 AM, Daniel Veillard <veillard redhat com> wrote:
> On Thu, Apr 16, 2009 at 01:51:10PM -0700, Aaron Patterson wrote:
>> Hi,
>>
>> There seems to be strange behavior in libxml2 with regard to encoding
>> when parsing an HTML file. If an HTML file contains a meta tag
>> hinting at the encoding, libxml2 will use the encoding in the meta tag
>> *unless* there are strange characters before the meta tag.
>>
>> If there are strange characters before the meta tag, libxml2 will
>> guess the encoding and use the guessed encoding for the rest of the
>> document even though the meta tag reported the correct encoding.
>> What's worse is that libxml2 will report that it used the encoding
>> from the meta tag when outputting the content of the document
>> indicates that it did not.
>>
>> Here is an example of the behavior in action:
>>
>> http://gist.github.com/96641
>>
>> fail.html fails, and success.html "does the right thing".
>>
>> Should I report this in bugzilla?
>
> Yes please. The encoding handling is a real problem in HTML
> because you can get content and hence have to parse before possibly
> getting the meta tag (if available !)
> That was fixed in XML by the xmlDecl and rules to parse it without
> encoding informations a priori.
I've reported the bug here:
http://bugzilla.gnome.org/show_bug.cgi?id=579317
I wasn't sure how I should set the priority. I set it to critical
because my data is incorrect and I don't have a work around besides
parsing the document myself, looking for the encoding, then passing
the encoding to libxml2.
Thanks for the help!
--
Aaron Patterson
http://tenderlovemaking.com/
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]