On Thu, Nov 29, 2007 at 08:54:56AM +0100, Murray Cumming wrote: > On Wed, 2007-11-28 at 19:42 +0000, Hugo Mills wrote: > > Hi, > > > > I'm trying to use the SAX parser from libxml++ to read a simple XML > > file generated from a third-party program. At the head of the file is > > an XML declaration specifying the charset encoding: > > > > <?xml version="1.0" encoding="ISO-8859-1"?> > > > > A short distance into the file is the following text: > > > > <sub-title lang="en">Highlights of the final of the Grand Slam of Darts, played over the best of 35 legs. The winner will be crowned the inaugural champion and receive a cheque for £80,000. [S]</sub-title> > > > > (Just in case that's got mangled in transit, that's the > > entity/character literal 0xa3, for the UK Pound symbol in ISO-8859-1). > > > > When I pass this to libxml++, I get a Glib::Error thrown, > > complaining about "Invalid byte sequence in conversion input". It > > seems that libxml++ is reading the &#A3; and converting it to a byte, > > then trying to interpret that as UTF-8, which it isn't. I've tried > > converting the input chunk before I pass it to the parser (using > > Glib::convert), but obviously that isn't working, as it's processing > > the entity as its component characters, rather than converting it to a > > byte sequence. > > What does xmllint say? Not much: hrm vlad:calliope $ xmllint --output tmp.xml whatson.xml hrm vlad:calliope $ grep 80,000 tmp.xml <sub-title lang="en">Highlights of the final of the Grand Slam of Darts, played over the best of 35 legs. The winner will be crowned the inaugural champion and receive a cheque for <A3>80,000. [S]</sub-title> It converts the problem entity correctly into the single byte it represents (as this is an ISO-8859-1 document). Also: hrm vlad:calliope $ xmllint --debugent --output tmp.xml whatson.xml new input from file: whatson.xml DOCUMENT No entities in internal subset No entities in external subset Hugo. -- === Hugo Mills: hugo carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Unix: For controlling fungal diseases in crops. ---
Attachment:
signature.asc
Description: Digital signature