Re: charset issue on Windows help needed

From: David Marceau <uticdmarceau2007 yahoo ca>
To: gtk-app-devel-list gnome org
Subject: Re: charset issue on Windows help needed
Date: Wed, 17 Sep 2014 00:33:39 -0400

No wonder you're lost.  You're jumping around using different character
systems(Linux UTF-8 Versus Windows UTF-16) and character subsets(full
language character 255+ set down to ASCII 127 subset without diacritics)
from different operating systems(Linux and Windows).

Let's clarify: multi-lingual web pages are all utf-8....especially in
Linux land.  emails are all encoded in utf-8 if you want all your
multilingual emails to display all the different characters from the
different languages to display correctly.

Now going back to PRE-WINDOWS(DOS-only), legacy ascii with different
code pages to deal with the different countries on the planet ended up
giving us different characters with different diacritics in different ways.

When WINDOWS came, bringing along its opinion "lets represent all the
characters in the world with two-bytes called "UTF-16" AKA Windows
Unicode.  This actually makes sense when running and storing strings in
RAM memory at execution time.

Unix said hey let's do it with 4 bytes called "UTF-32" just to be
different and even more future-proof i.e. introduce Vulcan language and
yet to be discovered tribal languages.  This actually makes sense when
running and storing strings in RAM memory at execution time too.

The real standard surfaced when web browsers wanted to display multiple
languages in one web page: UTF-8.  one would have to resort to text
files that were tweaked with UTF-8 encoding using special text editors
that could handle it.  UTF-8 is more complicated because each character
on the planet is represented with a different number of bytes.  When
storage was a concern, UTF-8 wins because it takes up less space and
only uses up the necessary number of bytes to represent each character.
UTF-8's strengths: -backward compatibility with ASCII -simpler in terms
of endianness and byte order -as a result of the other two strengths,
better for storage on disk.

-----------------------------------------
It goes without saying the moment you hit windows, you were bound to run
into problems :)

I hope you are using boost and g++, gtkmm.  You shouldn't be tripping
over small details like this now.  There are os helper calls for path
separators that you should use.

For the path separator, boost::filesystem::path("/").native()


Your env var still has the diacritics:

GNC_DOT_DIR is set to "c:\gcdev\Łukasz"

gdb as L"c:\\gcdev\\Lukasz" and

L"c:\\gcdev\\Lukasz"
wchar_t "c:\\gcdev\\Lukasz"

val_utf8 as "c:\\gcdev\\Lukasz"


For the path separator, boost::filesystem::path("/").native()

who set the environment variable?
who created the directory?
Why are they different to begin with?

Do you want to preserve the diacritics or not?

---------------------------------------
If you do want to preserve the diacritics, then proceed with UTF-16 aka
w_char strings.  I am guessing this is what you should want.  Just use
strings classes, but make sure before you
.c_str() Will return the correct character type pointer:
const charT* c_str() const noexcept;
Which will give you the correct pointer to an array of UTF-16 characters
for all the WINDOWS OS API calls.


c++ has string classes that elegantly deal with everything and then just use
std::string s = u8"Hello, World!";
// #include <codecvt>
std::wstring_convert<std::codecvt<char16_t,char,std::mbstate_t>,char16_t> convert;

std::u16string u16 = convert.from_bytes(s);
std::string u8 = convert.to_bytes(u16);

If ever you did a length in bytes for both these strings, you would
notice they are very different.

------------------------------------------------
If you don't want to preserve the diacritics, then you could consider
using the source code for uni2ascii with the "-d" switch.
http://www.billposer.org/Software/uni2ascii_man.html
uni2ascii - convert UTF-8 Unicode to various 7-bit ASCII representations

-d
    Strip diacritics. This converts single codepoints representing
characters with diacritics to the corresponding ASCII character and
deletes separately encoded diacritics.


On 09/16/2014 09:19 PM, Fernando Rodriguez wrote:

On Tuesday 16 September 2014 6:56:11 PM Geert Janssens wrote:

On Saturday 13 September 2014 14:24:35 Fernando Rodriguez wrote:

On Saturday 13 September 2014 4:21:18 PM Geert Janssens wrote:

Thanks a lot !

I'll try to apply a similar approach in gnucash for the
home dir use case.

For my second case, anybody know how to read an
environment variable directly in win32 using wide char
functions ?


Geert
_______________________________________________
gtk-app-devel-list mailing list
gtk-app-devel-list gnome org
https://mail.gnome.org/mailman/listinfo/gtk-app-devel-list


You'd use the GetEnvironmentVariable function. What compiler and IDE
are you using?

I'm using mingw32, gcc 4.8.1.

In Windows every API function that deals with text has two variants,
so theres GetEnvironmentVariableA and GetEnvironmentVariableW.

If you use Visual Studio there's an option on project properties to
select the encoding and it #defines a symbol that defines the macros
to call either variant and you would use the TCHAR (there's a bunch
of typedefs for dealing with strings for example LPTSTR, LPSTR,
LPWSTR but they all boil down to the same thing, in this case LPSTR
is typedef for char*, LPWSTR wchar_t*, and LPTSTR is char* for ansi
and wchar_t* for unicode) type to store the strings (it's a typedef
for char when using ANSI and wchar_t when using unicode). I'm not
sure what the symbol is called so if you're not using VS you can just
use the wide char variants directly or look at the headers and find
out what you need to define.

Thanks for the additional detail (for some reason my previous mail got

truncated by the list

software.

I have now written this function:

#ifdef G_OS_WIN32
#define BUFSIZE 4096

The maximum size of a user-defined environment variable is 32,767
characters. There is no technical limitation on the size of the
environment block. However, there are practical limits depending on the
mechanism used to access the block. For example, a batch file cannot set
a variable that is longer than the maximum command line length.

On computers running Microsoft Windows XP or later, the maximum length
of the string that you can use at the command prompt is 8191 characters.
On computers running Microsoft Windows 2000 or Windows NT 4.0, the
maximum length of the string that you can use at the command prompt is
2047 characters.

You shouldn't be defining functions like this.  Let the Operating System
and let Mingwin/Cygwin do what they do best for you.  Like I said, try
to stay at higher-level easier to use api's in c++/boost/gtkmm.  You're
making it harder on yourself.

static gchar* get_env_utf8 (const gchar* var_name)
{
    LPWSTR val_win;
    gchar *val_utf8;
    guint32 retval;
    
    ENTER();
    val_win = (LPWSTR) malloc (BUFSIZE*sizeof(WCHAR));
    if (!val_win)
        return NULL; /* Out of memory... */
    
    retval = GetEnvironmentVariableW (g_utf8_to_utf16 (var_name, -1, NULL,

NULL, NULL),

val_win, BUFSIZE);
    if (0 == retval)
        return NULL;  /* Variable not set */
    
    if (BUFSIZE < retval)
    {
        PWARN("Value of environment variable GNC_DOT_DIR is longer than %d.

              "The code can't handle this, so returning NULL instead.",

BUFSIZE);

        return NULL; /* Woa, path is way to long... */
    }
    
    if ((val_utf8 = g_utf16_to_utf8 ((gunichar2*) val_win, -1, NULL, NULL,

NULL)) != NULL)

        return val_utf8;
    else
        return NULL;
}
#endif

But in the end val_utf8 still doesn't keep the special characters.

If the environment variable GNC_DOT_DIR is set to "c:\gcdev\Łukasz", val_win

is printed in

gdb as L"c:\\gcdev\\Lukasz" and
val_utf8 as "c:\\gcdev\\Lukasz"

If I examine the individual characters using
print val_win[9] and print val_win[10] those result in
L'L' and L'u'. To me that looks as if there are no wide characters in the

original string. :(


This is really puzzling me. What am I missing ?

Geert


Sorry for taking so long, I've been struggling with something of my own. Did 
you ever get this sorted?

I'm not sure what's wrong. Are you running gdb from the console in windows or 
from an IDE?

I would try printing the bytes as hexadecimal to make sure you're not being 
lied to (use a short pointer to print them) or display it on a messagebox:

MessageBoxW(NULL, val_win, NULL, NULL); 

Also how did you set the env variable?
Can you echo it on the console or look it up on the GUI to make sure it's set 
right?


Let me know what you find, I'm curious.

References:
- charset issue on Windows help needed
  - From: Geert Janssens
- Re: charset issue on Windows help needed
  - From: Fernando Rodriguez
- Re: charset issue on Windows help needed
  - From: Geert Janssens
- Re: charset issue on Windows help needed
  - From: Fernando Rodriguez

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]