Re: [xml] setting the default charset ?



Le dim, jui 29, 2001, à 01:34:41 +0800, Steve Underwood a écrit:


The is getting into the area where I was last month with HTML and
libxml. I think the existing override should most certainly work in the
way is does now - with no ambiguity. However, if you have a lot of
broken XML to process. you clearly need more than that.

When this happened to me with HTML it was clear that libxml should
provide the extra things needed. Broken HTML is the norm, and not the
exception. Everyone faces exactly the same set of problems trying to
cope with it (at least they do if they handle a wide cross-section of
the world's HTML). Here I am not so sure. There could be many broken
forms of XML around, and they are all just plain _wrong_. XML is well
enough defined that these things should not have happened. I'm not
convinced that libxml has any place working around broken XML. Maybe you
need to locally patch libxml to tolerate the particular garbage you are
throwing at it.

OK, my previous post was a bit inflamatory. Let me summarise my situation,
my wishes and the solutions I see:

0) Definitions
--------------

I call "asciish" an 8-bit charset whose lower 7-bit part more or less
matches ASCII. In an XML context, if an XML file begins with the four bytes
0x3c, 0x3f, 0x78, 0x6d, then in my book it's asciish. ISO 8859, UTF-8 and
ECMA-102 fall into this category.

The problem with asciish is that you can't automatically detect which
flavour is used ; you have to store that information somewhere if you want to
univoqually read back your data stream. This role in XML is filled by the
encoding="..." specification.

1) Situation
------------
  The application I'm working on is dia, a diagram package which is now more
or less part of GNOME. Of the 1244 computers running Debian which respond to the Popularity
Contest (PopCon) survey, there are 73 actively and 211 casually running dia.
Most other distributions carry dia, including non-Linux ones (we regularly
get reports from Solaris and Windows users). Many people are running
outdated versions, so while I plead shared guilt for not having switched
from libxml1 earlied, we'll have to deal with libxml1-generated files for a
while. And yes, this means libxml1-generated "garbage".

  Libxml1 (at least the way we used it) generated XML files with no
encoding="..." specification, and didn't care at all what flavour of asciish
was used. In dia's context, what happened is that we wrote lots of files
which are encoded in whatever 8-bit charset each user happened to run dia.
So we have 8859-1 files, 8859-2, KOI8-R, Big5 and several other encodings,
and we have no way to tell them apart. What we can do, however, is to assume
that if a user loads a file without encoding specification, then he wrote
the file, and it's probable that the file has been encoded using the same
"local" charset as the one he's running dia into today. This is not an
assumption I want libxml2 to make in my place ; that's an assumption I wish
libxml2 accepted I told it to make.

2) Wishes
---------
With Daniel's latest patch, libxml2 has two modes of action:
        1) (normal behaviour) it assumes the XML files are not broken.
        2) xmlSwitchEncoding() has been called before the parser ; in that
        case libxml2 won't try to detect the encoding, and will follow
        whatever it has been told to. However, this won't prevent
        xmlParseEncodingDecl() from being ran and from overriding what the user
        asked (this is not very consistent).

What I think would be nice to have:
        * discourage the use of xmlSwitchEncoding() by applications.
        * provide a pair of functions:
                - xmlAdvise8BitDefaultEncoding()  (I'd even s/8Bit/Asciish/)
                - xmlForceEncoding()
        which have to be run on a context before parsing takes place.

xmlForceEncoding() would do what xmlSwitchEncoding(from an application point
of view) is supposed to do: disable libxml's encoding detection routines,
and use whatever the user supplied, because the user knows better.

xmlAdvise8BitDefaultEncoding() would be used to say "in the case it's an
8-bit asciish encoding, then don't assume it's UTF-8 but <this encoding>.
However, if the EncodingDecl is present, follow it." [*]

Both of these calls means the application takes responsibility for deviation
from the standard ; if that could appease concerns, these calls could be
made to work only if an "int magic" parameter was equal to a magic constant
(defined like:
        #define YES_BREAK_THE_STANDARD 21354
). So, the usage would be something like:
        xmlAdvise8BitDefaultEncoding(YES_BREAK_THE_STANDARD, enc);

[*] What's an HTML agent is supposed to do when it's been told a stream is
in encoding "foo" but a <META> tag tells it it's in "bar" ? It looks like a
very similar situation to me (with the difference that, as you point out, in
HTML world garbage is the norm).

3) Solutions
------------
Currently, I have a few options and non-options:
        * Have a private fork of libxml to have the behaviour above. This is
totally out of question. It's not some in-house private app I have to make accept a
known body of broken data, but a GNOME and GNU app.
        * Hope the above wish finally makes sense and gets integrated.
        * For each file the user wants to open, open it first, and check its
contents. If it's asciish and it lacks the EncodingDecl, then fix the file
on the fly (making whatever assumption I think fits) and have libxml open
the temp file. Otherwise, hand the file over to libxml. 
        This means I have to build a parser for the <?xml ... ?> stuff in my
application. This also means that, regardless of the brokenness or not of
the file to be loaded, I have to first gzopen() it, then close it, then have
libxml re-open it. Basically, this sucks, but if I do this, then I don't need to
wait for yesterday's patch to become mainstream (if I have to do the laundry
myself, I'll do it whole). But I'll surely spend the time coding that
support swearing and cursing against the people responsible for that
situation <grin/>.

        -- Cyrille

-- 
Grumpf.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]