Re: [xml] i18n limitations in libxml2

On Sun, Jun 10, 2001 at 05:16:22PM +0800, Steve Underwood wrote:
I haven't used anything but UTF-8 in my XML work so far, so I haven't
bothered learning much about encoding issues. However, if I am reading
the XML spec correctly, any encoding listed by W3C is OK for XML. The
current push parser mode in libxml2 only seems to allow for the much
more limited set of encodings listed in the XML spec itself. Maybe I am
misreading things.

  Hum, no the XML push parser should be able to work with any encoding :

orchis:~/XML -> ./xmllint --push test/isolat1 
<?xml version="1.0" encoding="ISO-8859-1"?>
orchis:~/XML -> 

  or I misunderstood something you said.

   Lets avoid the multiple META HTTP-EQUIV="Content-Type" for the moment
and focuse on infrastructure needs.
   How many interfaces to the HTML parser are missing the facility to
use a predefined encoding declaration ? Could you list them ?

I think the multiple meta tag issue is not a problem. If people include
that level of garbage in their HTML they really do deserve trouble.
People expecting (and getting) the effect that a meta tag overrides a
MIME encoding specification is more an issue. I think the best course
here is to implement things as per the HTML spec and see how much still
fails to function correctly. I don't understand the logic in the spec,
since I would think a meta tag would be a much more reliable source of
an encoding specification, since it is a part of the document itself.

  Well except that if the document encoding is converted between creation
and parsing and that the tools used to convert are too dumb to remove or
update the META tag(s) (It probably with such tools taht you end up with
multiple META HTTP-EQUIV="Content-Type" definitions :-\

Bottom line: The missing pieces are those which would allow a encoding
from a MIME header (or some similar source) to be correctly applied as a
parser is constructed:

- A form of push parser create function which allows an arbitrary
character set from a MIME header to be fed to it

   Okay makes sense.

- Fixes to htmlSAXParseDoc and htmlParseDoc so they actually process
their encoding parameter.

   Okay makes sense.

 Note: it could be a good idea to check whether the iconv in glibc-2.3
has fixes for the iconv problems.

I don't know of any current glibc iconv bugs, though I'm sure there are
some. It hasn't had long enough for all the quirks to be ironed out, and
some of those quirks are too political to be ironed out quickly.

The problems I have been facing recently are embrace and extended
character set issues. The default font in the latest MS Simplified
Chinese Outlook, for example, seems to use a default Chinese font whose
name is not valid GB2312. That screws up a pretty large selection of
HTML e-mails. The patch I posted last week (which was not intended to be
applied verbatim, since it does not allow control of the bad character
recovery procedure) causes libxml2 to function just like Netscape 4.7x,
Mozilla 0.9 and older versions of IE on Windows (which don't seem to
support the extended GB2312 characters), as far as I can tell. These all
seem to give one question mark for every byte that must be skipped
before characters can be decoded properly again. Mozilla is slightly
different, in that the question mark it displays is a kind of fancy
black diamond with a question mark in it.

  Okay this could be implemented, but the patch would have broken the
stric error checking of the XML parser. Seems I need to provide in some
ways a differenciation between HTML and XML encoding error handling.

I think my patch is basically the right solution. It just needs suitable
control, so it is switched off for XML and anyone trying to validate an
HTML page can control it. This should be a low risk change, since a
clean document is completely unaffected by the change. If someone is
using the parser as a validator, however, there would be compatiblity

  Okay I will look at this again,


Daniel Veillard      | Red Hat Network
veillard redhat com  | libxml Gnome XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]