[xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?

From: Matt Poff <matt poff headfirst co nz>
To: xml gnome org
Subject: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?
Date: Tue, 13 Jan 2009 09:14:10 +1300

Hi all,

I've run out of options trying to get correctly encoded output fromxmllint so hopefully someone has an idea here on the list.

I have a migration pipeline that takes HTML files with UTF8 encodedcharacters and pipes them through XMLlint to produce valid XHTML. Thisis then queried by an XSLT files called by ETL scripts. However, nomatter what flags I use on xmllint, I cannot get it to output theXHTML with the UTF-8 encoding preserved. If I specify UTF-8 encoding Iend up with what looks like double-encoded UTF-8 characters and if Idon't specify encoding, the original UTF-8 is mapped to HTML entitiesbut these too look like entities for two UTF-8 characters. It seemsthat, early in the parsing process, the UTF-8 is corrupted.

If I download the file with curl the UTF-8 is preserved and visible soit's specific to xmllint. I even tried downloading the HTML file,running it through iconv and then XMLlint but this made no difference.

As far as I'm aware I'm running the latest version of xmllint(packaged with OSX Leopard) but have also tried to run the process inRHEL5 with the same results.

Is there a fix for this or can anyone suggest a workaround.Transformation of the HTML to strict XHTML is critical to my workflow,as is preserving the UTF-8.


Many thanks,
Matt

Follow-Ups:
- Re: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?
  - From: Bjoern Hoehrmann
- Re: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?
  - From: Sebastian Rahtz

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]