[xml] UTF8Toisolat1


I'm wondering how to deal with characters that don't map to ISO 8859-1

Unfortunately, Swish-e indexes eight bit chars only, so I need to map to
8859-1.  So, in my character handler I call UTF8Toisolat1().  But if that
call fails, I just pass onto swish-e what libxml called me with, which is
UTF-8.  That means I'll might index words with UTF-8 chars as ASCII.

Of course, there's no good solution.  I need to pass onto the swish-e
indexer what libxml passes me.  Just because one character couldn't be
converted doesn't mean I should throw out all the text the parser sent me.  

The other option would be to replace the non-8859-1 character with a space

It seems like most of conversion errors I'm seeing are from entities
outside of the Latin-1 range - and often for symbols that swish-e would
want to ignore anyway.

I'm a bit lost in the libxml entity code and I'm not very familiar with
encodings in general.  Is there a way in the API to a) detect that an
entity is outside Latin-1, and b) replace it with another character (such
as a space)?  That's not perfect, but at least I won't be confusing a
character in a UTF-8 sequence as an ASCII character, and my calls to
UTF8Toisolat1() will be happier.


/usr/doc/packages/rpm/RPM-Changes/RPM-Changes-3.html:65: warning: Failed to
convert internal UTF-8 to Latin-1.
Indexing w/o conversion.
transparent. If this works the devil better buy some Prestone&trade;.</LI>

Bill Moseley
mailto:moseley hank org

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]