Re: [xml] Scripting languages and character encodings
- From: Malcolm Tredinnick <malcolm commsecure com au>
- To: Steve Ball <Steve Ball zveno com>
- Cc: xml gnome org
- Subject: Re: [xml] Scripting languages and character encodings
- Date: Thu, 22 Jan 2004 10:13:48 +1100
On Thu, 2004-01-22 at 08:15, Steve Ball wrote:
Other language bindings have probably solved this problem,
so I'm posting looking for advice...
I'm developing the Tcl binding for libxml2 and there is currently
an issue with character encodings. Tcl (v8.1+) is very good at
dealing with character encodings - when reading data from an
input channel it automatically converts the data to UTF-8
(the internal encoding). My libxml2 wrapper code gets a memory
buffer containing the UTF-8-encoded document and passes that
to libxml2... works great!
However, some documents specify their character encoding in
the XML declaration, ie <?xml version='1.0' encoding='iso-8859-1'?>
It would appear that libxml2 expects the document to contain
characters in that encoding (which is perfectly reasonable!),
but Tcl has already converted those to UTF-8. Of course, this
results in an error. Similar problems occur on output.
My question is, can I tell the libxml2 parser that a document
has a certain character encoding, overriding what the XML
The short answer is "no". If you are changing the encoding of the data
in the document, then you need to change the encoding in the XML
declaration as well.
This shouldn't be a particularly onerous pre-parsing step, since it is
one attribute in the first line of the file.
The issue of overriding the encoding declaration has been discussed on
the list a few times previously and Daniel's position has always been
that passing in data that is not well-formed XML is not something that
libxml is going to handle (the XML specification requires as much). The
most recent thread on this is titled "Control over encoding declaration
(prolog and meta)" that started on 14 Jan 2004 (check the online
archives). Obviously you don't want to do the re-encoding idea mentioned
in that thread, since then you go from ISO-8859-1 -> UTF-8 -> ISO-8859-1
-> libxml which is less than efficient, but the justifications may be of
interest to you.
] [Thread Prev