Re: my worry about the recent libxml change



> >    At the xml file level, sure. But the new parser will only provide
> > UTF8 characters, i.e. the first level of indetermination related to the
> > encoding of the XML serialization is removed. Apps only have to deal with
> > UTF8 input.
> > 
> >    The scheme saying "we will use what is the current locale" is IMHO
> > broken, you won't be able to keep your configuration files if you change
> > your locales. It's also completely incoherent in the sense that it kindof
> > works only for the ISO-Latin families, I doubt it will ever work for 
> > SHIFT-JIS, EUC-xxx or even worse UTF16 (I assume all the existing APIs
> > in Gnome-1.4 would completely crash if i were to deliver 16bits chars
> > strings if it were the actual serialization used, and remember, a number
> > of Windows application use then when serializing XML !). In a nutshell
> > it's a kludge and you people need to change their mindset, assuming
> > 'locale' encoding for serialization doesn't work.
> 
>  If libxml was saving charset name in xml headers (and later used it when
> opening), the approach I proposed won't be a cludge and will be very
> consistent (yes, files written under one locale would be correctly readable
> under any other locale since charset in which xml data are in would be present
> in xml header).

  Let's get things straight. Libxml2 has support of saving in a given encoding
automatic conversion on input, yadda, yadda for one year now. But it only
expose UTF8 to the application simply for sanity reasons. I don't intend
to change libxml2 nor the new libxml1 parser. The old libxml1 does something
completely unpredictable, I will let it rot and die, because it simply need
to.
  I still don't accept the saving in user locale because *it's not always
possible !*. You must understand that given an XML infoset (or a parsed tree
in Unicode) the operation of "serializing to charset XYZ" can fail. Even
if the node content may always be translatable using character references
the Name production of the XML specification don't accept them and this
is an unreliable operation.
  Moreover this is also a costly operation, translating on input and output
systematically ain't cheap.
  Last but not least, XML must not be hand edited, if you reasoning behind
keeping XML files in a given locale is the ability for the user to edit them
then it's the best way to end up with the kind of XML errors that Gnome
has all over the place right now (i.e. divergent advertized XML encoding
and actual content).

>  Also, I think it would be nice to provide functions for getting text from xml
> tree in local charset, and for setting strings in local charset to xml tree.
> These functions should be provided by libxml of course.. In that case, these

- libxml2 uses iconv for the conversions, libxml1 don't use Iconv.
- libxml won't do any charset detection it's not the proper place,
  libxml is platform agnostic, charset detection is highly platform
  dependant
- Providing a function to convert an UTF8 string to a given charset
  can be done in libxml2 but it's gonna be costly because it would require
  calling the Iconv opening/conversion/close functions for all calls.
Better done at another level which knows the charset you want to convert
strings to and keep the Iconv handler for this encoding for the full
lenght of the processing.
  Using another library than iconv would be stupid:
    - it's very complex
    - it would have the same problem because most I18N convertions need
      a state
    - it would be bloat since iconv is present I think on all the platforms
      that Gnome targets.

  Don't try to do design I18N stuff to fast, you will get it wrong, 

Daniel

-- 
Daniel Veillard      | Red Hat Network http://redhat.com/products/network/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]