Re: [gnumeric-list][PATCH - new version] handle unicode and codepagesfor Import/Export of Excel files



On Wed, 14 Mar 2001, it was written:

 Hello Jody, 

It is too close to a release to try something like this.
I need more time to consider and test it.  Possibly after release.

 OK. Feel free to ask any questions on the patch..
 
On Tue, Mar 13, 2001 at 05:02:50PM +0400, gnumeric-list-admin gnome org wrote:
@@ -587,12 +594,23 @@
            ans->hidden = MS_BIFF_H_VISIBLE;
            break;
    }
+#if 0
    if (ver == MS_BIFF_V8) {
-           int slen = MS_OLE_GET_GUINT16 (q->data + 6);
+           int slen = MS_OLE_GET_GUINT16 (q->data + 6);            
            ans->name = biff_get_text (q->data + 8, slen, NULL);
-   } else {
+   } else 
+#endif
+   { 
+           /* 
+            * there are test files produced by non-latin1 Excel (e.g. 
+            * russian version) that prove that branch above is 
+            * incorrect. It seems test files that insured author of branch
+            * above were produced by latin1 version of Excel - 
+            * in that case q->data[7] is always 0, so it can be attributed
+            * to length of sheet name or to the string header.
+            *                      - Vlad Harchev <hvv hippo ru>
+            */
            int slen = MS_OLE_GET_GUINT8 (q->data + 6);
-
            ans->name = biff_get_text (q->data + 7, slen, NULL);
    }
This is one of the few areas that the XL file format docs are very
clear on.  I'll need to study this in more detail before it can go
in.

 Then the spec lies. In the file I sent you, that uses russian as sheet names,
the 1st two sheets have 5-character-long sheet names, stored in unicode, and
for that file the following will be true: 
MS_OLE_GET_GUINT8 (q->data + 6) == 5
MS_OLE_GET_GUINT8 (q->data + 7) == 1 (string header - that sets 'word_chars'
                        to 1 in biff_string_get_flags())
MS_OLE_GET_GUINT8 (q->data + 8) -> string in unicode (paresed as unicode
                                  since string header at q->data+7 tells so)

 If interpreted using the branch I commented out - i.e. according to the spec 
as you say, it's parsed as
MS_OLE_GET_GUINT16 (q->data + 6) == 261 == (0x100 + 0x5) - i.e. the string
                                will be 261 characters long!
MS_OLE_GET_GUINT8 (q->data + 8) string without header - will be treated as
                        multibyte string (and will be of the form
                0x4 0x30 0x4 0x32 ... in that file - since unicode page for
                russian is 5th)

 Yes, for strings that are encoded using windows codepage - i.e. in
multibyte form - e.g. ascii - MS_OLE_GET_GUINT8 (q->data + 7) is 0, that can
be attributed to 2nd byte of 16-bit wide length, or the string header with no 
flags set.

 So, it's clearly that MS spec for biff v8 is broken in this respect.

+static char*
+get_locale_charset_name()
+{
+#ifndef HAVE_ICONV
+   return "";
+#else
+   static char* charset = NULL;
+
+   if (charset)
+           return charset;
+           
+#ifdef _NL_CTYPE_CODESET_NAME
+   charset = nl_langinfo (_NL_CTYPE_CODESET_NAME);
+#elif defined(CODESET)
+   charset = nl_langinfo (CODESET);
+#elif
#else
+   {
+           char* locale = setlocale(LC_CTYPE,NULL);
+           char* tmp = strchr(locale,'.');
+           if (tmp)
+                   charset = tmp+1;
+   }
+#endif  
+   if (!charset)
+           charset = "ISO-8859-1";
+   charset = g_strdup(charset);
+   return charset;
^^^^^
This seems like a recipe for problems.
Is the caller is responsible for freeing the string ?  What ensures
that the value will be valid ?  If that is the case then the
!HAVE_ICONV case will fail.

 No, the caller never frees the returned value. The function you've quoted is
private, so only functions of that module can call it and they know what its
semantic is.
 The returned string is freed nowhere, but since it's allocated only once, it
doesn't hurt.

+#endif
+}

+guint
+excel_iconv_win_codepage()
+{
+   char* lang = NULL;
+   static guint codepage = 0;
+   char* env_lang;
+   
+   if (codepage)
+           return codepage;
+           
+   /* the code below is executed only once */
If it is called only once why leak ?

 No, the function is called multiplie times, but the function caches the
answer in static variable 'codepage', its value is computed only first time
function is called, on the subsequent calls the cached value (from static
variable) is returned.
 But yes, the 'lang' is leaked on the 1st time of the call - I didn't coded
freeing it to simplify function body. If you think it's better to free that it
after use - I can rewrite patch for this.

 Best regards,
  -Vlad





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]