Re: g_getenv() encoding on Windows



J. Ali Harlow writes:
 > I've got no objection from an ABI point of view since I haven't built  
 > 2.6.0 yet anyway.

Good.

 > I'd like to ask more about the motivation, however.

(Let's restrict ourselves to NT-based Windows here, I don't really
want to think of Win9x with its 16-bit roots...)

The situation is quite similar as with file names, which is the main
motivation why I suggested this. Environment variables are stored by
the system as Unicode. (To verify, try setting environment variables
in the Control Panel's System applet, switching keyboard layout so
that you can enter non-system-codepage Unicode chars into the
variable's name and/or value. The non-system-codepage chars will show
up as question marks in the output from Command Prompt's "set"
command, but they are correctly accessible from a program using the
wide-char Win32 API.)

To be as generic as possible and work in all circumstances, GLib
should use the wide-character Win32 API to manipulate environment
variables. (Just using the wide-character C runtime API is not enough,
see below.)

Another reason is that environment variables often contain path names
directly anyway (like PATH, GTK2_RC_FILES, GTK_IM_MODULE), or are
often used to construct file names. Thus it would be cleaner if
g_getenv() would provide them right away in UTF-8, otherwise you would
typically call g_locale_from_utf8() on the return value anyway.

 > As I undertand it, the point of switching to UTF-8 for filenames is
 > that MS-Windows stores these in unicode internally rather than in
 > the system codepage so you can run into problems if the filename
 > can't be encoded in the system codepage.

How environment variables work in Windows is a mess. (Hardly a
surprise, is it?) 

Starting from the system's perspective, environment variables are
stored as Unicode (UTF-16) internally, per-process, in an area not
accessible to the process (AFAIK).

The calls GetEnvironmentVariable() and SetEnvironmentVariable() access
this area. (Like most Win32 API, the calls exist as wide character
(Unicode) and "ANSI" (system codepage) versions, suffixed with "W" and
"A". The "W" versions are the "real" ones, the "A" ones just wrappers,
AFAIK.)

Then, we have the C runtime, which keeps its own copy of the
environment variables (!) This copy is set up by the C library startup
code, and is pointed to by the Unix-like char **environ variable. But
that's not all. The C runtime provides parallel "ANSI" and wide-char
API (getenv(const char*) vs. _wgetenv(const wchar_t*), putenv(const
char*) vs. _wputenv(const wchar_t*). Also the **environ pointer has
its wide-char counterpart, wchar_t **_wenviron.

A further complication is that applications can have either a normal
"main(int,char**)" or "wmain(int,wchar_t**)" function. I haven't
really done much experimenting with this, I don't even know if the
mingw compiler supports the wmain stuff. But anyway, in "main" apps
the C runtime initialized its environment table in system codepage,
and in a "wmain" app as Unicode. As most (all) GLib/GTK+-using apps
presumably have a "main" and not "wmain", environment variables not
expressable in the system codepage are presumably broken right from
the start.

(To even further complicate things, in the multi-thread-safe C library
MSVCRT.DLL, which is used by GLib, GTK+ etc, and is supposed to be
used by GLib/GTK+-using apps, the environ pointers are actually macros
that expand to a function call. But lets ignore that for now.)

Here is a quote from a message I sent some years ago to the
mingw-users list:

  You don't need Win32 source to check how environment variables work in
  the C runtime. The C runtime sources are enough. You get the sources
  to (most of) msvcrt.dll with the Platform SDK, which is freely
  downloadable (well, "freely installable") from Microsoft.

  After a quick glance it seems to me that it works like you say most
  Unix C runtimes do it; there is a global _environ variable that points
  to ar array of char pointers.

  I don't see, offhand, any indication that the size of the environment
  in the C runtime would be limited by anything else than heap space. In
  particular, check the putenv.c and setenv.c files.

  (If you look at the sources, they are complicated a bit by having a
  separate environment array for Unicode- and non-Unicode programs
  (_environ and _wenviron), and still having things work if a Unicode
  programs calls a non-Unicode function that calls putenv(), for
  instance. A Unicode program is here defined as one compiled with
  -D_UNICODE, and then uses the t-forms defined in <tchar.h> of various
  C runtime functions to access the wide char versions. This is
  orthogonal to accessing the wide versions of Win32 API functions,
  which is indicated by compiling with -DUNICODE.)

  One thing that makes environment variables more interesting is that
  the operating system also maintains environment variables. See
  GetEnvironmentStrings(), GetEnvironmentVariable() and
  SetEnvironmentVariable(). (Compare to Unix, where the environment is
  strictly something in the C runtime only. Umm, except that it is
  passed to a new process in the exec functions, but othewise a Unix
  kernel doesn't know anything about it.) The Microsoft C runtime
  initialises the C environment from the one maintained by the OS, and
  after changing or adding an environment variable, it updates the one
  in the OS, too.

  Hmm, now I think I see a possible explanation to your problem. If the
  SetEnvironmentVariable() call fails, for instance because of a too
  large environment variable (the limit is 32K), __crtsetenv() returns
  -1, even if the C runtime environment has been updated correctly. Upon
  __crtsetenv() returning -1, putenv() then free()s the space allocated
  for the copy of the environment variable, leaving a dangling pointer
  in _environ! Seems like a clear bug in Microsoft"s code to me.

  If you modify _environ directly, the OS view of the environment won"t
  be affected. But if you start child processes with the exec or spawn
  functions, they will pass _environ (as modified by you) to
  CreateProcess().

Try setting the environment variable "FOO" to a value containing both
Latin and Cyrillic characters, for instance. Then run this sample
program:

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>

int
main (int argc, char **argv)
{
  wchar_t buf[100], *p;
  int i, j, k, n;

  n = GetEnvironmentVariableW (L"FOO", buf, 100);
  printf ("GetEnvironmentVariableW: %d: \"", n);

  for (i = 0; i < n; i++)
    if (buf[i] == '\\')
      printf ("\\\\");
    else if (buf[i] < 0x80)
      printf ("%c", buf[i]);
    else
      printf ("\\u%04x", buf[i]);
  printf ("\"\n");

  p = _wgetenv (L"FOO");
  printf ("_wgetenv: %p: \"", p);

  if (p != NULL)
    while (*p)
      {
	if (*p == '\\')
	  printf ("\\\\");
	else if (*p < 0x80)
	  printf ("%c", *p);
	else
	  printf ("\\u%04x", *p);
	p++;
      }
  printf ("\"\n");

  return 0;
}

You will notice that the GetEnvironmentVariableW works correctly, but
_wgetenv() doesn't, the non-system-codepage chars are replaces by
question marks. 

--tml





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]