Re: [xml] Character encoding fixes



On Mon, Jul 14, 2003 at 11:16:31AM +0200, Peter Jacobi wrote:
Hi Daniel, All,

I have my ISO-8859-* character conversion ready, but I'm still struggling 
with the elimination of 'native' ISO-8859-2 etc handling, i.e. the legacy
behaviour of using non UTF8 internal representation.

  Well, the parser should never generate non UTF8 internal representation.
The only way at the moment to have a trouble with this is if one modify an
existing tree with non UTF-8 encoded strings, or did I missed something.

*** ctxt->charset

To simplify matters I first propose the (near) elimination of
ctxt->charset. As it always hold XML_CHAR_ENCODING_UTF8 when
a non enumerated encoding is handled, its significance cannot be
that high.

  Hum, it will have to remain in the structure for compatibility at least.
I'm also wondering if it ain't used for "progressive detection", for example
at the point where the parser detects that the encoding is ascii-compatible
but before reading the encoding.

My current working hypothesis is that it actually specifies libxml's
internal encoding, and as we want to finally get rid of all non-UTF8
internal encodings, ctxt->charset can be eliminated. In that process
some zombie else clauses for non-UTF8 internal encodings can
be eliminated too.

  This is in general true, but correctness may need to be cautious there.

The only place where it may have significance is at "case XML_PARSER_START"
in parser.c. If there is a valid case for not autodetecting encoding from 
the first four bytes, ctxt->charset can be used as a flag for this purpose.

  yes and that's the case quite often.

o xmlParseCharEncoding((const char *) encoding)

This use of this function is commented by "registered
set of known encodings" but when it detects, that the
encoding is "known", the semantics of this is rather
weak. It only means that the encoding is given an enumeration
by libxml and it doesn't mean that libxml2 can actually
handle it.

What to do with this function?

   Keep it around until it's clean it not used nor needed anymore.

a) Leave as is for unknown clients but never call it from libxml2

b) Make it return "known encoding" only for the encodings
libxml2 can actually handle without registered handlers?

c) Make it return "known encoding" for encodings libxml2
can handle natively or due to registered handlers (but then
there's no difference to xmlGetCharEncodingHandler).

Seems a) - retiring is best.

  yes, sounds right, but this does not mean removal. Once sure it's
deprecated, then move it to parserInternals.c return an error code like
other deprecated functions there.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]