[xml] Ignoring Character Encodings


Ok, so I realise I maybe treading on dangerous ground here (some of the
posts in the archive about encodings get quite scary!), but I have a small
question regarding character encodings.

Currently, all documents opened by our application get converted to our own
standard internal encoding. When I pass the document through to libxml, I
map the doc from our encoding to UTF-8 and map the output from libxml back
again. Fine, no problem.

As is entirely expected (and indeed essential as far as normal XML parsing
is concerned), if the document itself contains an encoding declaration in
the <?xml...?> line, libxml wants to switch encoding to the one specified,
and not the UTF-8 I'm giving it.

As I know the doc has already been converted into UTF-8 before libxml gets
to see it (because I did it), is there any way of telling it to ignore the
encoding declaration contained in the doc, and to stick with the one I've
told it to use?

I'm not supposed to modify the document in any way (i.e. to temporarily hide
the encoding declaration), but I could if I really had to - in reality, I'd
probably just change the encoding code in libxml by hand ending up with our
own 'doctored' version.
Also, it would be a waste of computing power to change the doc back into
it's real encoding only for libxml to change it straight back again!

If there isn't a proper way using the API that I've missed, I've got a patch
which just adds an extra run-time flag in xmlSetFeature(), and an if
statement in xmlSwitchEncoding() that will just break out of the function if
an encoding is already set (and the flag is set, of course).


Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]