Re: [xml] setting the default charset ?

Cyrille Chepelov wrote:

Le sam, jui 28, 2001, à 02:48:01 -0400, Daniel Veillard a écrit:

Mmmmh, yes, it looks like it fixes my problem. However, it's possible it
brings a deviation from standard or reasonible behaviour:
    1) I call xmlSwitchEncoding(KOI8-R);
    2) I start parsing a file
    3) That file happens to be encoded in UTF-16.

    Then you made an error. Might be due to a confusion between locale
and encoding. If you say xmlSwitchEncoding(KOI8-R) the you have to be sure
it is actually encoded in this encoding. You cannot just drop whatever locale
is there and expect this to work. In XML there is no guess work. If you guess
and you are wrong then you die, it is normal and has to stay this way.

I'm not guessing anymore than the default (which is "if there's no encoding
spec and if it looks asciish, then it must be UTF-8"). What I want to say is
"if you don't know, and if it looks asciish, then it's not UTF-8 but <foo>".
This is not so much guesswork ! And if the file is not asciish (EBCDIC,
UTF-16, UCS-4, whatever), then none of this logic shall apply. Oh, and since
I make the call before I ask libxml to parse, I'm taking responsibility for
asking libxml2 to slightly deviate from the standard.

    Sorry I'm gonna stick to the standard here. You MUST know the encoding
if you decide to try to override the default behaviour.

My goal is not to override the encoding as understood by the library. What I
need to override is the "this looks like asciish, I'll believe it's UTF-8
until I see an encoding="..." attribute" default (and standard) behaviour.
I've got an application where 8-bit files without encoding="..." (files
produced by libxml1) are encoded in Who knows what encoding. As an
application writer, I want to take responsibility to tell libxml2 "if you
think it's a 8-bit encoding, and it's asciish, then I know what I'm asking
for, please break the standard and do what I need. But if finally the file
is not broken and has an encoding="..." specification *in the file*, then
please go back to standard behaviour."

Remember, my problem happens because I have to support a body of
libxml1-generated files, which are incomplete in the "encoding='...'"

What *is* possible is that xmlSwitchEncoding() is not the right semantic. If
xmlSwitchEncoding() from an application point of view means
"I know the truth, whatever you see, the encoding is <foo>, ignore the
rest", then I need another call (like
xmlAdviseDefault8BitEncodingAndBreakTheRules(bool YesIAmSure, xmlEncoding enc)).
And then xmlSwitchEncoding() (the user-visible thing) needs to make sure the
encoding="..." attribute is subsequently ignored (is it the case right now ?
I don't know ; I'm not sure).

My alternative: load the files once first, analyse them, fix them if they're
libxml1-generated ones, and then only feed them to libxml2. And I have to
drag this non-optimal behaviour forever. Yukk.

The is getting into the area where I was last month with HTML and
libxml. I think the existing override should most certainly work in the
way is does now - with no ambiguity. However, if you have a lot of
broken XML to process. you clearly need more than that.

When this happened to me with HTML it was clear that libxml should
provide the extra things needed. Broken HTML is the norm, and not the
exception. Everyone faces exactly the same set of problems trying to
cope with it (at least they do if they handle a wide cross-section of
the world's HTML). Here I am not so sure. There could be many broken
forms of XML around, and they are all just plain _wrong_. XML is well
enough defined that these things should not have happened. I'm not
convinced that libxml has any place working around broken XML. Maybe you
need to locally patch libxml to tolerate the particular garbage you are
throwing at it.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]