Re: [xml] i18n limitations in libxml2



  sorry for not having jumped in earlier, I am quite busy these
days with other stuff...

On Fri, Jun 08, 2001 at 11:29:22PM +0800, Steve Underwood wrote:
I must admit I didn't actually look at the specs. I'm so used to dealing
with the messiness of real world pages and browsers, I didn't actually
think of doing that :(  Commonly real pages expect the opposite
behaviour - that the meta tag will override the HTTP header - and the
browsers tend to give it to them. At least I now see the logic in the
libxml2 code.

   Actually the logic in libxml2 encoding handling is inherited from
the XML specification. Briefly, the XML specification has a relatively
strict way of detecting encoding and how to handle them, this includes
a pedantic way of dealing with encoding errors. The HTML parser encoding
support has been built on top of the facility designed for the XML parser
and this explains some of the points you would not expect from a native
HTML tool.

   Considering the HTML specification, I would expect the way of
handling encoding support to be not too different from the XML way
on purpose for the very simple point that HTML is switching to XML.
The strict XML rules are gonna be the norm for the next generations
of HTML languages.

  But ...

A _huge_ number of real world pages just expect the browser default
encoding to coax the right encoding behaviour -  especially Chinese
pages, and especially those produced with MS tools. If you aren't
displaying the HTML in a browser (I am not right now) this makes
handling the page very woolly and unpredictable. This much I have no
choice but to live with.

  as you said very well we have to live with billions of existing
HTML pages and a huge number of tools using and producing HTML "like" 
resources.
  The goal of the libxml HTML parser is to handle "real world" HTML,
so I'm ready to make the additions required. However usually adding
support for X might break existing Y support, especially for real
world HTML in my experience. So I will try to do this step by step.
  
and any previous language meta tags.

This is not defined by HTML, and "correct" documents are unlikely to
contain multiple, conflicting META HTTP-EQUIV="Content-Type" elements.

What percentage of real world HTML pages would pass a validation? We

   1% maybe ...

Anyway, getting back to my original need to specify the encoding for
push parsing, which is not currently possible. It seems that adding a
new "push creation" function, and making no other changes, would be
entirely consistent with the HTML spec. Maybe I should do that, and see
how many real world pages preduce the wrong result.

   Lets avoid the multiple META HTTP-EQUIV="Content-Type" for the moment
and focuse on infrastructure needs.
   How many interfaces to the HTML parser are missing the facility to
use a predefined encoding declaration ? Could you list them ?

   On the issue of iconv not supporting (or improperly) some encoding
variants, I think the right way to address this is twofold:
    - get in touch with the people maintining iconv (and possibly
      glibc), registering a bug report on bugzilla.redhat.com about
      glibc iconv support might be a good way to get this fixed assuming
      there is a clear way to name this encoding and the actual encoding
      description is easilly available and complete.
    - implement the pair of functions to convert to/from UTF8 and
      register them with xmlRegisterCharEncodingHandler()
 
 Note: it could be a good idea to check whether the iconv in glibc-2.3
has fixes for the iconv problems.

Daniel

-- 
Daniel Veillard      | Red Hat Network http://redhat.com/products/network/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]