Re: libxml2 in gnome 1.4



On Thu, Mar 22, 2001 at 11:57:16AM +0800, James Henstridge wrote:
> For instance with libxml2, if you ignore encodings and try to output a
> string like "ΟΦ" ("\317\326"),

   Well in this case what you output is not defined. In ISO_8859-1 this
will mean "ΟΦ", in other encodings this may mean something different
or simply break. By outputing "\317\326" you silently ignored the problem
but the problem is there, someone could also try to output "\0\317\0\326"
i.e. "ΟΦ" in UTF16, there would have been errors in this case too.

> it seemed to do a pretty non strict
> UTF-8 interpretation and ignore the 6th bit on the second character,
> treating it as a single UTF-8 characrter (the PI symbol in this case).  So

  This is separate, that seems to be a bug, but I didn't got any report
on it until now (why didn't you posted it ?) it's still time to fix it
and puke at saving time.

> Will programs that use libxml1 and not handle character encodings
> correctly break like this as well? 

  I think so, well try just fetch the RPMs and load it, you can test !

> If so, then this doesn't sound like a
> compatible change (even if it keeps the API binary compatible, it is a big
> change to the semantics).

  It's not a change in semantic, it is just that libxml now tries to detect
(and correct in some cases) applications making an incorrect use of XML.
The semantic is defined by the XML spec:
    http://www.w3.org/TR/REC-xml#charencoding

--------------
It is a fatal error when an XML processor encounters an entity with an
encoding that it is unable to process. It is a fatal error if an XML
entity is determined (via default, encoding declaration, or higher-level
protocol) to be in a certain encoding but contains octet sequences that
are not legal in that encoding. It is also a fatal error if an XML
entity contains no encoding declaration and its content is not legal
UTF-8 or UTF-16.
--------------
  
  In the absence of encoding information (and that's the case when the
document is loaded), you should expect UTF8 or UTF16. I don't think the
existing API led anybody to think that libxml would use UTF16 :-)

  The page on I18N for libxml has been there for one year:
   http://xmlsoft.org/encoding.html

  And the subject is raised like every 2 weeks on the libxml mailing
list, so saying that this is a big change is a bit biased, I would rather
state it as "everybody knew it was broken, but nobody took the challenge
to fix libxml1 or the apps". Of course the fact that libxml2 got separated
from the Gnome development platform didn't helped. libxml-1.2.12 is
a chance to catch the train in time for 2.0...
  
> This should all get a lot easier with gtk 2.0 and libxml2 :)

  Not necessarily, you will in both cases need to handle the transcoding
from the user input method to UTF8, gtk 2.0 will certainely help but
I doubt it will handle all cases of input. 
  Being able to spot the point where you have I18N problems and get
them reported seems to me a good way to speed up the cleaning process needed
for 2.0 anyway.

Daniel

-- 
Daniel Veillard      | Red Hat Network http://redhat.com/products/network/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

_______________________________________________
gnome-hackers mailing list
gnome-hackers gnome org
http://mail.gnome.org/mailman/listinfo/gnome-hackers




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]