Re: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?

From: Matt Poff <matt poff headfirst co nz>
To: xml gnome org
Subject: Re: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?
Date: Tue, 13 Jan 2009 10:44:20 +1300

Thanks for you help guys,

If I download the file with curl the UTF-8 is preserved and visible so
it's specific to xmllint. I even tried downloading the HTML file,
running it through iconv and then XMLlint but this made no difference.

Does the HTML document start with a byte order mark? Does it include a
<meta http-equiv='Content-Type' content='text/html;charset=utf-8'> tag?
If you have libxml2 download the content, does the HTTP respone contain
a Content-Type:text/html;charset=utf-8 header?

I've done a bit more experimenting and think I've nailed it. The problem seems to occur because the source HTML defines some mark-up before the <meta> tag defining the utf-8 charset. This mark-up contains a UTF8 character. The parser strikes it and this throws out the rest of the parsing. If I ensure that the first tag in the <head> is the charset <metatag> the encoding proceeds as expected.

Interesting that the encode flag of xmllint is ignored under these circumstances though! Guess I'll have to strip this tag out before I pass the source to xmllint.

Thanks again,

Matt

References:
- [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?
  - From: Matt Poff
- Re: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?
  - From: Bjoern Hoehrmann

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]