Re: [xml] i18n limitations in libxml2



Thomas Broyer wrote:

Le 08/06/01 04:18:05, Steve Underwood a ?crit :
The meta tag processing code specifically ignores any language selection
if an explicit textual language selection has already been made for the
current document.

Just to avoid misunderstandings, we are talking about character encodings,
not languages.

Yes, we are talking about encoding. I was being sloppy.

Can someone explain the logic of this? The normal behaviour of HTML
parsers is to allow language meta tags to override the initial language
setting of the document,

Actually that's not true.
In <http://www.w3.org/TR/html4/charset.html> , 5.2.2 Specifying the
character encoding:
  ?To sum up, conforming user agents must observe the following priorities
   when determining a document's character encoding (from highest priority
   to lowest):
      1. An HTTP "charset" parameter in a "Content-Type" field.
      2. A META declaration with "http-equiv" set to "Content-Type" and
         a value set for "charset".
      3. The charset attribute set on an element that designates an
         external resource.
   In addition to this list of priorities, the user agent may use
   heuristics and user settings. For example, many user agents use a
   heuristic to distinguish the various encodings used for Japanese text.
   Also, user agents typically have a user-definable, local default
   character encoding which they apply in the absence of other indicators.?

I must admit I didn't actually look at the specs. I'm so used to dealing
with the messiness of real world pages and browsers, I didn't actually
think of doing that :(  Commonly real pages expect the opposite
behaviour - that the meta tag will override the HTTP header - and the
browsers tend to give it to them. At least I now see the logic in the
libxml2 code.

The heuristic approach is indeed widely used with Japanese, but doesn't
work well with other languages. For example, several programs exist to
distinguish between various Asian encodings. They do very well on a fair
sized text, but rather poorly on small pieces. A very brief HTML page
can be short enough to fool them, whether or not you strip tags before
the analysis.

A _huge_ number of real world pages just expect the browser default
encoding to coax the right encoding behaviour -  especially Chinese
pages, and especially those produced with MS tools. If you aren't
displaying the HTML in a browser (I am not right now) this makes
handling the page very woolly and unpredictable. This much I have no
choice but to live with.

and any previous language meta tags.

This is not defined by HTML, and "correct" documents are unlikely to
contain multiple, conflicting META HTTP-EQUIV="Content-Type" elements.

What percentage of real world HTML pages would pass a validation? We
still need to serve them up tolerably well to users. I like the XML idea
- reject any imperfection. Hopefully that principal will avoid the HTML
chaos in the XML world. But then, maybe not.....
 
The current behaviour of libxml2 seems to achieve the opposite of
general parser behaviour.
 
"General" parser behaviour and _conforming_ parser behaviour are not the
same.
If I understand correctly what you're talking about, libxml2 has a
comforming behaviour ; otherwise, just forget what I said...



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]