[xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?
- From: Matt Poff <matt poff headfirst co nz>
- To: xml gnome org
- Subject: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?
- Date: Tue, 13 Jan 2009 09:14:10 +1300
Hi all,
I've run out of options trying to get correctly encoded output from
xmllint so hopefully someone has an idea here on the list.
I have a migration pipeline that takes HTML files with UTF8 encoded
characters and pipes them through XMLlint to produce valid XHTML. This
is then queried by an XSLT files called by ETL scripts. However, no
matter what flags I use on xmllint, I cannot get it to output the
XHTML with the UTF-8 encoding preserved. If I specify UTF-8 encoding I
end up with what looks like double-encoded UTF-8 characters and if I
don't specify encoding, the original UTF-8 is mapped to HTML entities
but these too look like entities for two UTF-8 characters. It seems
that, early in the parsing process, the UTF-8 is corrupted.
If I download the file with curl the UTF-8 is preserved and visible so
it's specific to xmllint. I even tried downloading the HTML file,
running it through iconv and then XMLlint but this made no difference.
As far as I'm aware I'm running the latest version of xmllint
(packaged with OSX Leopard) but have also tried to run the process in
RHEL5 with the same results.
Is there a fix for this or can anyone suggest a workaround.
Transformation of the HTML to strict XHTML is critical to my workflow,
as is preserving the UTF-8.
Many thanks,
Matt
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]