Re: [libxml++] Charset conversion error -- ignoring encoding declaration?

On Thu, Nov 29, 2007 at 08:54:56AM +0100, Murray Cumming wrote:
> On Wed, 2007-11-28 at 19:42 +0000, Hugo Mills wrote:
> > Hi,
> > 
> >    I'm trying to use the SAX parser from libxml++ to read a simple XML
> > file generated from a third-party program. At the head of the file is
> > an XML declaration specifying the charset encoding:
> > 
> > <?xml version="1.0" encoding="ISO-8859-1"?>
> > 
> >   A short distance into the file is the following text:
> > 
> > <sub-title lang="en">Highlights of the final of the Grand Slam of Darts, played over the best of 35 legs. The winner will be crowned the inaugural champion and receive a cheque for &#xA3;80,000. [S]</sub-title>
> > 
> >    (Just in case that's got mangled in transit, that's the
> > entity/character literal 0xa3, for the UK Pound symbol in ISO-8859-1).
> > 
> >    When I pass this to libxml++, I get a Glib::Error thrown,
> > complaining about "Invalid byte sequence in conversion input". It
> > seems that libxml++ is reading the &#A3; and converting it to a byte,
> > then trying to interpret that as UTF-8, which it isn't. I've tried
> > converting the input chunk before I pass it to the parser (using
> > Glib::convert), but obviously that isn't working, as it's processing
> > the entity as its component characters, rather than converting it to a
> > byte sequence.
> What does xmllint say?

   Not much:

hrm vlad:calliope $ xmllint --output tmp.xml whatson.xml
hrm vlad:calliope $ grep 80,000 tmp.xml 
        <sub-title lang="en">Highlights of the final of the Grand Slam of Darts, played over the best of 35 legs. The winner will be crowned the inaugural champion and receive a cheque for <A3>80,000. [S]</sub-title>

   It converts the problem entity correctly into the single byte it
represents (as this is an ISO-8859-1 document). Also:

hrm vlad:calliope $ xmllint --debugent --output tmp.xml whatson.xml
new input from file: whatson.xml
No entities in internal subset
No entities in external subset


=== Hugo Mills: hugo | | ===
  PGP key: 515C238D from or
         --- Unix: For controlling fungal diseases in crops. ---         

Attachment: signature.asc
Description: Digital signature

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]