[xml] strange encoding behavior when parsing HTML files

From: Aaron Patterson <aaron patterson gmail com>
To: xml gnome org
Subject: [xml] strange encoding behavior when parsing HTML files
Date: Thu, 16 Apr 2009 13:51:10 -0700

Hi,

There seems to be strange behavior in libxml2 with regard to encoding
when parsing an HTML file.  If an HTML file contains a meta tag
hinting at the encoding, libxml2 will use the encoding in the meta tag
*unless* there are strange characters before the meta tag.

If there are strange characters before the meta tag, libxml2 will
guess the encoding and use the guessed encoding for the rest of the
document even though the meta tag reported the correct encoding.
What's worse is that libxml2 will report that it used the encoding
from the meta tag when outputting the content of the document
indicates that it did not.

Here is an example of the behavior in action:

  http://gist.github.com/96641

fail.html fails, and success.html "does the right thing".

Should I report this in bugzilla?

Thanks!

-- 
Aaron Patterson
http://tenderlovemaking.com/

Follow-Ups:
- Re: [xml] strange encoding behavior when parsing HTML files
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]