[libxml++] Fwd: specifying the encoding of a document



I couldn't post this question to the libxml list because it rejects my
email because of a "suspicious header", but Daniel answered it off-list.
I'm posting it to the libxml++ list for reference.

> >> If I understand correctly, I think that libxml will auto-detect a
document's encoding when I call something like xmlParseDocument(), by
examining the bytes and then the encoding declaration in the document
itself.
> >
> >   Right ... to some extend
> >
> >> However, I think that this is not guaranteed to succeed..
> >
> >   true, but this is bad behaviour from the XML generation side,
> > the encoding should always be indicated in the XML declaration if it's
not UTF-8/UTF-16.
>
> So, if it's not UTF-8 or UTF-16, but it specifies the encoding in the
XML declaration, then libxml (or any parser) can always guess the
encoding enough to be able to read that XML declaration?

  yes, c.f. Appendix F.

> >> So, how should I specify the document encoding explicitly when
calling something like xmlParseDocument()?
> >
> >   libxml2 follows exactly the guidelines of the XML spec for character
> > detection, i.e. appendix F of the XML Recommendation:
> >   http://www.w3.org/TR/REC-xml/#sec-guessing
> >   - autodetection for UTF-8/UTF-16
> >   - use of the encoding information in the XML declaration
> >   - locally user provided encoding overriding the previous two
> >
> >   if the encoding is user provided then it's not done at
> > xmlParseDocument()
> > level but higher in the API when building the parser context or
preparing the read like in xmlReadDoc() or xmlCtxtReadDoc().


Murray Cumming
murrayc murrayc com
www.murrayc.com
www.openismus.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]