Re: [xml] i18n limitations in libxml2

Hi Daniel,

Daniel Veillard wrote:

  sorry for not having jumped in earlier, I am quite busy these
days with other stuff...

On Fri, Jun 08, 2001 at 11:29:22PM +0800, Steve Underwood wrote:
I must admit I didn't actually look at the specs. I'm so used to dealing
with the messiness of real world pages and browsers, I didn't actually
think of doing that :(  Commonly real pages expect the opposite
behaviour - that the meta tag will override the HTTP header - and the
browsers tend to give it to them. At least I now see the logic in the
libxml2 code.

   Actually the logic in libxml2 encoding handling is inherited from
the XML specification. Briefly, the XML specification has a relatively
strict way of detecting encoding and how to handle them, this includes
a pedantic way of dealing with encoding errors. The HTML parser encoding
support has been built on top of the facility designed for the XML parser
and this explains some of the points you would not expect from a native
HTML tool.

I haven't used anything but UTF-8 in my XML work so far, so I haven't
bothered learning much about encoding issues. However, if I am reading
the XML spec correctly, any encoding listed by W3C is OK for XML. The
current push parser mode in libxml2 only seems to allow for the much
more limited set of encodings listed in the XML spec itself. Maybe I am
misreading things.
   Considering the HTML specification, I would expect the way of
handling encoding support to be not too different from the XML way
on purpose for the very simple point that HTML is switching to XML.
The strict XML rules are gonna be the norm for the next generations
of HTML languages.

  But ...

A _huge_ number of real world pages just expect the browser default
encoding to coax the right encoding behaviour -  especially Chinese
pages, and especially those produced with MS tools. If you aren't
displaying the HTML in a browser (I am not right now) this makes
handling the page very woolly and unpredictable. This much I have no
choice but to live with.

  as you said very well we have to live with billions of existing
HTML pages and a huge number of tools using and producing HTML "like"
  The goal of the libxml HTML parser is to handle "real world" HTML,
so I'm ready to make the additions required. However usually adding
support for X might break existing Y support, especially for real
world HTML in my experience. So I will try to do this step by step.

and any previous language meta tags.

This is not defined by HTML, and "correct" documents are unlikely to
contain multiple, conflicting META HTTP-EQUIV="Content-Type" elements.

What percentage of real world HTML pages would pass a validation? We

   1% maybe ...

Ooooh. Am optimist :)

Anyway, getting back to my original need to specify the encoding for
push parsing, which is not currently possible. It seems that adding a
new "push creation" function, and making no other changes, would be
entirely consistent with the HTML spec. Maybe I should do that, and see
how many real world pages preduce the wrong result.

   Lets avoid the multiple META HTTP-EQUIV="Content-Type" for the moment
and focuse on infrastructure needs.
   How many interfaces to the HTML parser are missing the facility to
use a predefined encoding declaration ? Could you list them ?

I think the multiple meta tag issue is not a problem. If people include
that level of garbage in their HTML they really do deserve trouble.
People expecting (and getting) the effect that a meta tag overrides a
MIME encoding specification is more an issue. I think the best course
here is to implement things as per the HTML spec and see how much still
fails to function correctly. I don't understand the logic in the spec,
since I would think a meta tag would be a much more reliable source of
an encoding specification, since it is a part of the document itself.

Bottom line: The missing pieces are those which would allow a encoding
from a MIME header (or some similar source) to be correctly applied as a
parser is constructed:

- A form of push parser create function which allows an arbitrary
character set from a MIME header to be fed to it

- Fixes to htmlSAXParseDoc and htmlParseDoc so they actually process
their encoding parameter.

The first change can't break anything, since it needs a new call. The
second change needs to occur with some caution.
   On the issue of iconv not supporting (or improperly) some encoding
variants, I think the right way to address this is twofold:
    - get in touch with the people maintining iconv (and possibly
      glibc), registering a bug report on about
      glibc iconv support might be a good way to get this fixed assuming
      there is a clear way to name this encoding and the actual encoding
      description is easilly available and complete.
    - implement the pair of functions to convert to/from UTF8 and
      register them with xmlRegisterCharEncodingHandler()

 Note: it could be a good idea to check whether the iconv in glibc-2.3
has fixes for the iconv problems.

I don't know of any current glibc iconv bugs, though I'm sure there are
some. It hasn't had long enough for all the quirks to be ironed out, and
some of those quirks are too political to be ironed out quickly.

The problems I have been facing recently are embrace and extended
character set issues. The default font in the latest MS Simplified
Chinese Outlook, for example, seems to use a default Chinese font whose
name is not valid GB2312. That screws up a pretty large selection of
HTML e-mails. The patch I posted last week (which was not intended to be
applied verbatim, since it does not allow control of the bad character
recovery procedure) causes libxml2 to function just like Netscape 4.7x,
Mozilla 0.9 and older versions of IE on Windows (which don't seem to
support the extended GB2312 characters), as far as I can tell. These all
seem to give one question mark for every byte that must be skipped
before characters can be decoded properly again. Mozilla is slightly
different, in that the question mark it displays is a kind of fancy
black diamond with a question mark in it.

Extended (official or unofficial) character sets seem a fairly
widespread issue, especially for East Asian encodings. This problem is,
therefore, also widespread. To give tolerable results it must be
possible to ride over bad characters in HTML, regardless of whether
iconv has any problems. Registering encoding conversion functions is no
solution. There are already perfectly good conversions for all the
character sets I regularly use in the current iconv. Documents are
arriving in bulk in what claims to be one of these encodings, but don't
quite follow the standard. iconv correctly returns a bad character
condition when it hits bad characters. Everything is correct except the
source material - a pretty common situation with HTML.

I think my patch is basically the right solution. It just needs suitable
control, so it is switched off for XML and anyone trying to validate an
HTML page can control it. This should be a low risk change, since a
clean document is completely unaffected by the change. If someone is
using the parser as a validator, however, there would be compatiblity


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]