[xml] UTF-8-encoded filenames on Win32 and libxml2



Hi,

As you might now, from 2.6, GLib is using UTF-8 as the file name
encoding for all (hopefully) of its API on Windows. It provides
so-called gstdio wrappers in <glib/gstdio.h> for the standard POSIX
and C functions that take pathnames as arguments, like g_open().

On Unix, these wrappers are simply #defines for the actual C or POSIX
function. On Windows, they convert from UTF-8 to wide characters and
call the C library's wide character function, for instance _wopen() in
g_open(). (Let's ignore Win9x for now.)

There were two reasons for this change:

1) Windows file names *are* in Unicode in the file system, so it's
certainly most correct to handle them as Unicode and not shoehorn them
into a restricted codepage representation. For instance, support file
names with Cyrillic letters on a Western European Windows box. I think
it is also relatively common in CJK locales to use characters not in
the corresponding double-byte codepage.

2) In the double-byte code pages the trailing byte can be '\\', which
otherwise is a directory separator. This means that all code that
scans pathnames byte by byte looking for backslashes (either stepping
through a string manually, or using strchr() or strrchr()) is broken
by design, and would need to be rewritten heavily with ugly ifdefs to
use multi-byte string functions on Win32. There are a lot of such
places. UTF-8 doesn't have any such issue.

Now, upper level GNOME libraries that use GLib can mostly be converted
trivially to use the gstdio wrappers. (I use "GNOME" in a loose sense
here. Of course a GNOME desktop as such doesn't and won't exist on
Windows, but many of the GNOME libraries are being ported to Windows
so that it will be able to build GNOME applications on Windows.)

Now, a problem are libraries that don't use GLib, but are widely used
by GNOME libraries. For instance libxml2.

As the GNOME libs get "UTF-8 aware", i.e. are converted to use the
gstdio wrappers, what should be done with pathnames passed to libxml2?
If I convert them to system codepage, this means it won't work to have
XML files with pathnames that aren't representable in the system
codepage. This will not be good, as the intention otherwise is to make
everything work just fine with any non-ASCII file name.

I found one earlier message to this list about this issue,
http://mail.gnome.org/archives/xml/2001-October/msg00072.html . There
the suggested solution was to override libxml2's default I/O
interface. Presumably this would be by calling
xmlRegisterInputCallbacks() with an open callback that would call the
gstdio wrappers, but otherwise would be more or less a copy of the
default xmlFileOpen(). Is this still the recommended approach?

--tml




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]