Re: [xml] Ignoring Character Encodings



From: "Daniel Veillard" <veillard redhat com>
So the correct thing I'd need to do is to either remove (hide) the
encoding
declaration or to re-encode the doc back into its original format before
feeding it to the parser?

  Well if you know that documents will be ISO-8859-1 and already converted
to UTF8 you could override the default converters for this encoding by
supplying basically identity converters.
   Defined on the same page I pointed you to in the previous mail
     http://xmlsoft.org/encoding.html


Good idea, thanks.
I don't always know that they would be in ISO-8859-1, but we certainly don't
support an infinite number of encodings! At least this way removes the need
to modify / break the way the library works.

On the subject of encodings (a small sidetrack), how do the encoding
functions
in libxml handle memory management?
In theory, converting from ISO to UTF-8 could increase amount of memory
require to store the output 4 times (I think some ISO-8859-1 characters
convert to 3 or 4 byte long UTF-8 characters, don't they?)
I've had a look in the ISO-UTF encoding functions, and they just assume
that there will be enough memory to store the extra bytes in the output
buffer.
And I get lost through the nest of function callbacks trying to trace things
back.

I'm only curious because if I was going to use your method, I might as well
use
these new encoding functions to handle the translation from our internal
format
to UTF-8 instead of doing it as a separate step before hand - removes an
unnecessary stage in the processing. Our format is quite close to
ISO-8859-1.

Thanks,
Richard


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]