[xml] Scripting languages and character encodings



Other language bindings have probably solved this problem,
so I'm posting looking for advice...

I'm developing the Tcl binding for libxml2 and there is currently
an issue with character encodings.  Tcl (v8.1+) is very good at
dealing with character encodings - when reading data from an
input channel it automatically converts the data to UTF-8
(the internal encoding).  My libxml2 wrapper code gets a memory
buffer containing the UTF-8-encoded document and passes that
to libxml2... works great!

However, some documents specify their character encoding in
the XML declaration, ie <?xml version='1.0' encoding='iso-8859-1'?>
It would appear that libxml2 expects the document to contain
characters in that encoding (which is perfectly reasonable!),
but Tcl has already converted those to UTF-8.  Of course, this
results in an error.  Similar problems occur on output.

My question is, can I tell the libxml2 parser that a document
has a certain character encoding, overriding what the XML
declaration says?

Cheers,
Steve Ball

Steve Ball            |   XSLT Standard Library   | Training & Seminars
Zveno Pty Ltd         |     Web Tcl Complete      |   XML XSL Schemas
http://www.zveno.com/ |      TclXML TclDOM        | Tcl, Web Development
Steve Ball zveno com  +---------------------------+---------------------
Ph. +61 2 6242 4099   |   Mobile (0413) 594 462   | Fax +61 2 6242 4099




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]