[libxml++] Charset conversion error -- ignoring encoding declaration?



   Hi,

   I'm trying to use the SAX parser from libxml++ to read a simple XML
file generated from a third-party program. At the head of the file is
an XML declaration specifying the charset encoding:

<?xml version="1.0" encoding="ISO-8859-1"?>

  A short distance into the file is the following text:

<sub-title lang="en">Highlights of the final of the Grand Slam of Darts, played over the best of 35 legs. The winner will be crowned the inaugural champion and receive a cheque for &#xA3;80,000. [S]</sub-title>

   (Just in case that's got mangled in transit, that's the
entity/character literal 0xa3, for the UK Pound symbol in ISO-8859-1).

   When I pass this to libxml++, I get a Glib::Error thrown,
complaining about "Invalid byte sequence in conversion input". It
seems that libxml++ is reading the &#A3; and converting it to a byte,
then trying to interpret that as UTF-8, which it isn't. I've tried
converting the input chunk before I pass it to the parser (using
Glib::convert), but obviously that isn't working, as it's processing
the entity as its component characters, rather than converting it to a
byte sequence.

   How do I handle this input correctly with libxml++? Do I have to
preprocess each chunk manually to convert the character entities
before passing it to the parser, or is there some way of persuading
the SaxParser to do it?

   Thanks,
   Hugo.

-- 
=== Hugo Mills: hugo     carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- "What are we going to do tonight?" "The same thing we do ---     
            every night, Pinky.  Try to take over the world!"            

Attachment: signature.asc
Description: Digital signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]