Re: [xml] Adding parsed 8859-1 content to a UTF-8 document ...
- From: Daniel Veillard <veillard redhat com>
- To: denverrox denver <denverrox303 yahoo com>
- Cc: xml gnome org
- Subject: Re: [xml] Adding parsed 8859-1 content to a UTF-8 document ...
- Date: Sun, 20 Jul 2008 10:29:44 -0400
On Sat, Jul 19, 2008 at 03:29:32PM -0700, denverrox denver wrote:
I'm using libxml2 (specifically the HTMLparser/tree modules, and the xpath library) to perform
transformation operations on HTML input files, and have run into a character encoding issue:
Specifically, I have two HTML documents, one in 8859-1 encoding, and the other in UTF-8.
First I parse both documents into DOM trees.
Then, I'm performing an XPath on the 8859-1 document, cloning the resultset nodes using "xmlCopyNodeList,"
then using "xmlAddNextSibling" to add the 8859-1 document content into a document that was originally UTF-8
This results in the 8859-1 content not being correctly serialized if I output the UTF-8 document. Special
characters are garbled, etc.
I guess that assertion need a more precise description. Any character
in 8859-1 will en encoded with 1 or 2 bytes in UTF-8 without problem.
Based on the libxml2 encodings webpage ( http://xmlsoft.org/encoding.htmlhttp://xmlsoft.org/encoding.html
), it seems that libxml2 converts all character encodings to UTF-8 internally. Therefore unless I'm
misunderstanding something, the 8859-1 document should be in UTF-8 after parsing.
yes it is in UTF-8 internally
Is there any reason why this serialization problem should occur, if both the 8859-1 document and UTF-8
document are converted to native UTF-8 by libxml2? Shouldn't it "just work"? My impression is that you can
freely copy cloned nodesets between documents, as they're all internally in UTF-8. Careful review of the
libXML2 encodings page seems to agree with this assertion, so I'm quite stumped.
I don't think there should be any problem
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
] [Thread Prev