Re: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?

* Matt Poff wrote:
I have a migration pipeline that takes HTML files with UTF8 encoded  
characters and pipes them through XMLlint to produce valid XHTML. This  
is then queried by an XSLT files called by ETL scripts. However, no  
matter what flags I use on xmllint, I cannot get it to output the  
XHTML with the UTF-8 encoding preserved.

Do you specify the encoding when calling htmlCtxtRead* or whatever you
are using to parse the document? Generally, it would be better to check
what values are stored in memory by querying parts of the document, than
relying on the serialized result.

If I download the file with curl the UTF-8 is preserved and visible so  
it's specific to xmllint. I even tried downloading the HTML file,  
running it through iconv and then XMLlint but this made no difference.

Does the HTML document start with a byte order mark? Does it include a
<meta http-equiv='Content-Type' content='text/html;charset=utf-8'> tag?
If you have libxml2 download the content, does the HTTP respone contain
a Content-Type:text/html;charset=utf-8 header?
Björn Höhrmann · mailto:bjoern hoehrmann de ·
Am Badedeich 7 · Telefon: +49(0)160/4415681 ·
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · 

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]