Re: [xml] UTF-8-encoded filenames on Win32 and libxml2



On 14.04.2005 02:09, Tor Lillqvist wrote:
Hi,

Ho,

As you might now, from 2.6, GLib is using UTF-8 as the file name
encoding for all (hopefully) of its API on Windows. It provides
so-called gstdio wrappers in <glib/gstdio.h> for the standard POSIX
and C functions that take pathnames as arguments, like g_open().

On Unix, these wrappers are simply #defines for the actual C or POSIX
function. On Windows, they convert from UTF-8 to wide characters and
call the C library's wide character function, for instance _wopen() in
g_open(). (Let's ignore Win9x for now.)

That ignorance can be performed without any difficulty. That Win9x thing is not an operating system. It is a bad joke.

There were two reasons for this change:

1) Windows file names *are* in Unicode in the file system, so it's
certainly most correct to handle them as Unicode and not shoehorn them
into a restricted codepage representation. For instance, support file
names with Cyrillic letters on a Western European Windows box. I think
it is also relatively common in CJK locales to use characters not in
the corresponding double-byte codepage.

You bet. I have a box with british english Windows and still have files which employ japanese katakana, hiragana and/or kanji in their names.

2) In the double-byte code pages the trailing byte can be '\\', which
otherwise is a directory separator. This means that all code that
scans pathnames byte by byte looking for backslashes (either stepping
through a string manually, or using strchr() or strrchr()) is broken
by design, and would need to be rewritten heavily with ugly ifdefs to
use multi-byte string functions on Win32. There are a lot of such
places. UTF-8 doesn't have any such issue.

Precisely, Unicode does not have the issue. UTF-8, UTF-16 and UTF-32 are just coding forms for the same standard. They are algorithmically convertible.

Now, upper level GNOME libraries that use GLib can mostly be converted
trivially to use the gstdio wrappers. (I use "GNOME" in a loose sense
here. Of course a GNOME desktop as such doesn't and won't exist on
Windows, but many of the GNOME libraries are being ported to Windows
so that it will be able to build GNOME applications on Windows.)

Now, a problem are libraries that don't use GLib, but are widely used
by GNOME libraries. For instance libxml2.

Yes.

As the GNOME libs get "UTF-8 aware", i.e. are converted to use the
gstdio wrappers, what should be done with pathnames passed to libxml2?
If I convert them to system codepage, this means it won't work to have
XML files with pathnames that aren't representable in the system
codepage. This will not be good, as the intention otherwise is to make
everything work just fine with any non-ASCII file name.

I found one earlier message to this list about this issue,
http://mail.gnome.org/archives/xml/2001-October/msg00072.html . There
the suggested solution was to override libxml2's default I/O
interface. Presumably this would be by calling
xmlRegisterInputCallbacks() with an open callback that would call the
gstdio wrappers, but otherwise would be more or less a copy of the
default xmlFileOpen(). Is this still the recommended approach?

Plugging in your own IO is still the recomended approach. I hope that will someday change on all platforms. I would love to se Unicode as mandatory for file name storage everywhere. In fact, I would love it if all non-Unicode encodings would just vanish.

Now, using Unicode file names per default would certainly make libxml2 inoperable on all Windows incarnations which don't use the NTFS filesystem. I would welcome that.

But there are embedded platforms. Never forget, libxml2 does not only power the desktops like KDE or GNOME, it is also used on embedded hardware. How many of these can afford to support full Unicode range, given the memory and storage constraints?

Ciao,
Igor










[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]