[xml] Character encoding fixes



Hi Daniel, All,

I have my ISO-8859-* character conversion ready, but I'm still struggling 
with the elimination of 'native' ISO-8859-2 etc handling, i.e. the legacy
behaviour of using non UTF8 internal representation.

*** ctxt->charset

To simplify matters I first propose the (near) elimination of
ctxt->charset. As it always hold XML_CHAR_ENCODING_UTF8 when
a non enumerated encoding is handled, its significance cannot be
that high.

My current working hypothesis is that it actually specifies libxml's
internal encoding, and as we want to finally get rid of all non-UTF8
internal encodings, ctxt->charset can be eliminated. In that process
some zombie else clauses for non-UTF8 internal encodings can
be eliminated too.

The only place where it may have significance is at "case XML_PARSER_START"
in parser.c. If there is a valid case for not autodetecting encoding from 
the first four bytes, ctxt->charset can be used as a flag for this purpose.

o xmlParseCharEncoding((const char *) encoding)

This use of this function is commented by "registered
set of known encodings" but when it detects, that the
encoding is "known", the semantics of this is rather
weak. It only means that the encoding is given an enumeration
by libxml and it doesn't mean that libxml2 can actually
handle it.

What to do with this function?

a) Leave as is for unknown clients but never call it from libxml2

b) Make it return "known encoding" only for the encodings
libxml2 can actually handle without registered handlers?

c) Make it return "known encoding" for encodings libxml2
can handle natively or due to registered handlers (but then
there's no difference to xmlGetCharEncodingHandler).

Seems a) - retiring is best.

Regards,
Peter Jacobi







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]