[xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?

Hi all,

I've run out of options trying to get correctly encoded output from xmllint so hopefully someone has an idea here on the list.

I have a migration pipeline that takes HTML files with UTF8 encoded characters and pipes them through XMLlint to produce valid XHTML. This is then queried by an XSLT files called by ETL scripts. However, no matter what flags I use on xmllint, I cannot get it to output the XHTML with the UTF-8 encoding preserved. If I specify UTF-8 encoding I end up with what looks like double-encoded UTF-8 characters and if I don't specify encoding, the original UTF-8 is mapped to HTML entities but these too look like entities for two UTF-8 characters. It seems that, early in the parsing process, the UTF-8 is corrupted.

If I download the file with curl the UTF-8 is preserved and visible so it's specific to xmllint. I even tried downloading the HTML file, running it through iconv and then XMLlint but this made no difference.

As far as I'm aware I'm running the latest version of xmllint (packaged with OSX Leopard) but have also tried to run the process in RHEL5 with the same results.

Is there a fix for this or can anyone suggest a workaround. Transformation of the HTML to strict XHTML is critical to my workflow, as is preserving the UTF-8.

Many thanks,

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]