Re: [xml] setting the default charset ?



Le sam, jui 28, 2001, à 02:48:01 -0400, Daniel Veillard a écrit:

Mmmmh, yes, it looks like it fixes my problem. However, it's possible it
brings a deviation from standard or reasonible behaviour:
    1) I call xmlSwitchEncoding(KOI8-R);
    2) I start parsing a file
    3) That file happens to be encoded in UTF-16.

    Then you made an error. Might be due to a confusion between locale
and encoding. If you say xmlSwitchEncoding(KOI8-R) the you have to be sure 
it is actually encoded in this encoding. You cannot just drop whatever locale
is there and expect this to work. In XML there is no guess work. If you guess
and you are wrong then you die, it is normal and has to stay this way.

I'm not guessing anymore than the default (which is "if there's no encoding
spec and if it looks asciish, then it must be UTF-8"). What I want to say is
"if you don't know, and if it looks asciish, then it's not UTF-8 but <foo>".
This is not so much guesswork ! And if the file is not asciish (EBCDIC,
UTF-16, UCS-4, whatever), then none of this logic shall apply. Oh, and since
I make the call before I ask libxml to parse, I'm taking responsibility for
asking libxml2 to slightly deviate from the standard.

    Sorry I'm gonna stick to the standard here. You MUST know the encoding
if you decide to try to override the default behaviour.

My goal is not to override the encoding as understood by the library. What I
need to override is the "this looks like asciish, I'll believe it's UTF-8
until I see an encoding="..." attribute" default (and standard) behaviour.
I've got an application where 8-bit files without encoding="..." (files
produced by libxml1) are encoded in Who knows what encoding. As an
application writer, I want to take responsibility to tell libxml2 "if you
think it's a 8-bit encoding, and it's asciish, then I know what I'm asking
for, please break the standard and do what I need. But if finally the file
is not broken and has an encoding="..." specification *in the file*, then
please go back to standard behaviour."

Remember, my problem happens because I have to support a body of
libxml1-generated files, which are incomplete in the "encoding='...'"
departement. 

What *is* possible is that xmlSwitchEncoding() is not the right semantic. If
xmlSwitchEncoding() from an application point of view means 
"I know the truth, whatever you see, the encoding is <foo>, ignore the
rest", then I need another call (like
xmlAdviseDefault8BitEncodingAndBreakTheRules(bool YesIAmSure, xmlEncoding enc)). 
And then xmlSwitchEncoding() (the user-visible thing) needs to make sure the 
encoding="..." attribute is subsequently ignored (is it the case right now ? 
I don't know ; I'm not sure). 

My alternative: load the files once first, analyse them, fix them if they're
libxml1-generated ones, and then only feed them to libxml2. And I have to
drag this non-optimal behaviour forever. Yukk. 

        -- Cyrille

-- 
Grumpf.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]