[xml] UTF8Toisolat1
- From: Bill Moseley <moseley hank org>
- To: xml gnome org
- Subject: [xml] UTF8Toisolat1
- Date: Thu, 27 Sep 2001 08:41:27 -0700
Hi,
I'm wondering how to deal with characters that don't map to ISO 8859-1
characters.
Unfortunately, Swish-e indexes eight bit chars only, so I need to map to
8859-1. So, in my character handler I call UTF8Toisolat1(). But if that
call fails, I just pass onto swish-e what libxml called me with, which is
UTF-8. That means I'll might index words with UTF-8 chars as ASCII.
Of course, there's no good solution. I need to pass onto the swish-e
indexer what libxml passes me. Just because one character couldn't be
converted doesn't mean I should throw out all the text the parser sent me.
The other option would be to replace the non-8859-1 character with a space
character.
It seems like most of conversion errors I'm seeing are from entities
outside of the Latin-1 range - and often for symbols that swish-e would
want to ignore anyway.
I'm a bit lost in the libxml entity code and I'm not very familiar with
encodings in general. Is there a way in the API to a) detect that an
entity is outside Latin-1, and b) replace it with another character (such
as a space)? That's not perfect, but at least I won't be confusing a
character in a UTF-8 sequence as an ASCII character, and my calls to
UTF8Toisolat1() will be happier.
Thanks,
/usr/doc/packages/rpm/RPM-Changes/RPM-Changes-3.html:65: warning: Failed to
convert internal UTF-8 to Latin-1.
Indexing w/o conversion.
transparent. If this works the devil better buy some Prestone™.</LI>
^
Bill Moseley
mailto:moseley hank org
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]