Re: [xml] How do I read German Umlaute - entities from an XML-File using libxml?



On Fri, Oct 05, 2001 at 08:42:29AM -0700, Bill Moseley wrote:
At 11:11 AM 10/05/01 -0400, Daniel Veillard wrote:
 The libxml callback gives UTF8 encoded strings. It seems taht you expect
ISO-8859-1 encoded ones. You need to convert between both. There is a routine
called isolat1ToUTF8, use it or change you program to use UTF8 encoding.

BTW- Is there a way to get UTF8Toisolat1 to replace *non* 8859-1 chars with
something else (e.g. a space)?  

  No, hardcoding a given behaviour in case of error makes no sense.

Say I have a long UTF-8 string with a number of non Latin-1 chars that *do*
convert to Latin-1, but one character that doesn't.  UTF8Toisolat1 returns
an error and I'm forced to use the UTF-8 string which means I lost the
characters that would have been converted (and worse, using them as if they
were Latin-1). 

  No, UTF8Toisolat1 will convert everything it can. It may stop:
     - and return -2 if there is a conversion error, inlen and outlen
       return values allows you to process the way you like the next 
       UTF8 char and continue.
     - if the out buffer is full.
 This is explained in the documentation:
    http://xmlsoft.org/html/libxml-encoding.html#UTF8TOISOLAT1

 That convention is the same as all iconv() filters.

Most of the time the characters that don't convert are an entity for some
symbol that I wouldn't care about anyway.

Does that make any sense? 

   For you, maybe

Would that be helpful to anyone else?

   For some maybe, and would make it useless for everybody else.

Daniel

-- 
Daniel Veillard      | Red Hat Network http://redhat.com/products/network/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]