[xml] encoding issue using libxml with swish-e

Hi list,

If anyone can shine any light on this issue we're suffering I'd be
really grateful.  So far no amount of web trawls and debugging is
getting me anywhere.

We're using libxml with swish-e as part of our site indexing/searching
Our trouble is that we're unable to accurately parse/encode accented
characters.  Much code crawling has led us to the area of where libxml
is parsing the data and a few confusing symptoms are showing up.

We're trying to parse a file with a single word in it as a testing
example.  The word we're trying to parse is cinémathèque.  We've tried
parsing the word as both cinémathèque and cin&3232math&#233que.  Both
get the same result.

When we observe swish-e indexing this file it is being parsed by libxml
and handed to swish-e as cinÚmathÞque ie: the characters appear to have
been encoded incorrectly.

Then when we query the resulting index the situation gets even stranger
in that on our Windows quasi test box, the results actually work fine,
with the accented e's displaying correctly, however when we do the same,
with the same surrounding code, on Solaris the e's are returned as
unrecognisable characters.  A later html encoding functions suggests
these characters are �

An example of our xml follows
<?xml version="1.0" encoding="iso-8859-1" ?>
        <subpage title="melbourne
cin&#233;math&#232;que">&lt;P&gt;Cinémathèque offers a diverse program
of classic, cult, animation, experimental, documentary, silent and short
films. &lt;/P&gt;

We've also tried it with an explicit encoding of 
<?xml version="1.0" encoding="UTF-8" ?>

No change.

Our environment is:

Can anyone shed any light?

Many thanks in advance.

Tref Gare
Development Consultant
Level 19/114 William St, Melbourne VIC 3000
email: trefg areeba com au
phone: +61 3 9642 5553
fax: +61 3 9642 1335
website: http://www.areeba.com.au
"This email is intended only for the use of the individual or entity
named above and contains information that is confidential. No
confidentiality is waived or lost by any mis-transmission. If you
received this correspondence in error, please notify the sender and
immediately delete it from your system. You must not disclose, copy or
rely on any part of this correspondence if you are not the intended
recipient. Any communication directed to clients via this message is
subject to our Agreement and relevant Project Schedule. Any information
that is transmitted via email which may offend may have been sent
without knowledge or the consent of Areeba."

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]