Re: [libxml++] Charset conversion error -- ignoring encoding declaration?



On Thu, Nov 29, 2007 at 10:26:08AM +0000, Hugo Mills wrote:
> On Thu, Nov 29, 2007 at 08:54:56AM +0100, Murray Cumming wrote:
> > On Wed, 2007-11-28 at 19:42 +0000, Hugo Mills wrote:
> > > Hi,
> > > 
> > >    I'm trying to use the SAX parser from libxml++ to read a simple XML
> > > file generated from a third-party program. At the head of the file is
> > > an XML declaration specifying the charset encoding:
> > > 
> > > <?xml version="1.0" encoding="ISO-8859-1"?>
> > > 
> > >   A short distance into the file is the following text:
> > > 
> > > <sub-title lang="en">Highlights of the final of the Grand Slam of Darts, played over the best of 35 legs. The winner will be crowned the inaugural champion and receive a cheque for &#xA3;80,000. [S]</sub-title>
> > > 
> > >    (Just in case that's got mangled in transit, that's the
> > > entity/character literal 0xa3, for the UK Pound symbol in ISO-8859-1).
> > > 
> > >    When I pass this to libxml++, I get a Glib::Error thrown,
> > > complaining about "Invalid byte sequence in conversion input". It
> > > seems that libxml++ is reading the &#A3; and converting it to a byte,
> > > then trying to interpret that as UTF-8, which it isn't. I've tried
> > > converting the input chunk before I pass it to the parser (using
> > > Glib::convert), but obviously that isn't working, as it's processing
> > > the entity as its component characters, rather than converting it to a
> > > byte sequence.

   I've done a bit more poking, and it's rejecting the same character
after running through xmllint and converting to UTF-8 with iconv. If I
remove the character entirely, it then chokes in the same way on an e
acute (again, I checked that it's correctly UTF-8 encoded) later on in
the file.

   My system is configured to use en_GB.UTF-8 as the locale, and I'm
using libxml++ 2.14.0 from Debian stable, if that makes a difference.

   I've created a minimal test case showing the problem, available
from [1]. The test.sh script may need to be tweaked, depending on
where your libs are.

   Hugo.

[1] http://www.darksatanic.net/test-case.tar.gz

-- 
=== Hugo Mills: hugo     carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Nothing wrong with being written in Perl... Some of my best ---   
                      friends are written in Perl.                       

Attachment: signature.asc
Description: Digital signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]