Re: [xml] How to use iconv library to convert a UTF-8 string to GB2312 string



Li Baoli wrote:

Dear All,

 I'd like to develop a program dealing with XML documents which contain
Chinese text encoded in GB2312. I used the libxml2 library. Unfortunately, I
encountered problems while converting a UTF-8 string to GB2312 string using
iconv library. Below is the codes I wrote. Would you please give me some
hints about how to solve this problem? Thanks very much!

Li,

This is off topic for the XML list, but I don't know what list is so
I'll answer your question.

Does anyone know where general iconv questions get answered ?


======================================================
/*function invocation:*/

fprintf(stdout, "The converted result is %s\n",
codesetswitch("UTF-8","GB2312", UTF_8_String));

------------------------------------------

/*function declaration:*/

char *strcodechange(const char *from_code, const char *to_code, unsigned
char *instr)
{
iconv_t cd;
char from[BUFSIZ], to[BUFSIZ];

^^^^^^^^^^^^^^
You're returning a pointer to this 'to' array as your converted to
buffer .... ?

This buffer is invalid after the return from this function. You'll need
to allocate a buffer or pass one into this function.

Bascially, this is a poorly designed interface, e.g. how do you handle
long strings ? This code will overflow the buffers and hence is a
security risk. Also, how do you handle partial characters or stateful
encodings like iso2022 ? I don't remember what the standard says but you
should expect that the iconv_t structure contains stateful information
about where you are in your conversion.

a) Manage you buffer sizes more carfully
b) keep your context for managing long strings

Woops, this then points to using an interface like iconv itself ...
maybe you need to integrate iconv more closely or perhaps keep a limited
interface like you have but be much more careful with managing your buffers.

There is one major problem with iconv, admitedly it's an edge case. In
the case of a corrupted input stream, there is very little information
that lets you know how to recover forcing you to embed knowledge of the
encoding in your own code (if you want to be able to recover from errors
that is).

// char *from_code, *to_code;
char *tptr;
const char *fptr;
size_t ileft, oleft, ret;

^^^^^^^

when is ileft initialized ? This is definitly another problem.


cd = iconv_open((const char *)to_code, (const char *)from_code);

strcpy(from, (const char *)instr);

^^^^^^^^^^^

potential buffer overflow.


memset(to,0,BUFSIZ);

^^^^^^

no need to call memset - just costs.


if (cd == (iconv_t)-1)
{
 /** iconv_open failed*/
 (void) fprintf(stderr, "iconv_open(%s, %s) failed\\n", to_code,
from_code);
 return NULL;
}

fptr = from;
tptr = to;
oleft = BUFSIZ;
ret = iconv(cd, &fptr, &ileft, &tptr, &oleft);


Potential mismatch on buffer sizes. utf8 may expand to somthing much
larger than the input string (worst case).

if (ret != (size_t)-1)
{
 /** iconv succeeded*/
 (void) fprintf(stdout,"string:%s ; result: %s\n", from, to);
}
else
{
 (void)fprintf(stderr, "iconv error!\n");
 return NULL;
}

(void) iconv_close(cd);

return to;
}
---------------
/*OutPut*/

iconv error!

^^^^^^^
probably due to uninitialized ileft. But this is only the first one of
many problems.


==================================================================

 Best regards,

 Li Baoli

--------------------------------------------------
Li Baoli
Institute of Computational Linguistics
Department of Computer Science and Technology
Peking University
Beijing, 100871   Phone: 86-10-6276 5835 ext 203
P.R. China        Email: libaoli btamail net cn
--------------------------------------------------



_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml gnome org
http://mail.gnome.org/mailman/listinfo/xml
 







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]