Re: [xml] Ignoring Character Encodings



From: "Daniel Veillard" <veillard redhat com>
I'm feeding the document in through the push parser in large chunks, but
as
libxml gets to the encoding declaration, it suddenly switches encoding
on me
and starts reporting errors that the UTF-8 chars I'm giving it don't
match
the
ISO-8859-1 (for example) encoding the document says it's in, and used to
be in before I got to it. Subsequently, the errors from the libxml
encoder
causes
the parse to fail.

  Right, if the document declares to be in ISO-8859-1, but is actually
in UTF-8 that's a fatal error, your document is not XML. You need to
fix it before handing it to the parser (well this can be argued upon
in the case where the framework provides the encoding, like when using
HTTP encoding informations associated with the Content-Type).
  But I consider libxml2 rejecting document whose declared encoding
does not match the actual one to be a feature, not an error.


I understand - I was just hoping for a "We've already done the encoding
for you" type of option. The situation occurs because the user can load the
document into our app, during which we have to encode it to our internal
format (otherwise all of our existing functionality falls apart). They can
then
edit the doc inside the app before asking for it to be parsed. As such, they
still need to specify the encoding declaration in the doc so we know what
we need to do to load it, and can save it in the right format. (The original
document before we load it really will be ISO-8859-1, and will remain so
if they save the document back out post-parsing)

So the correct thing I'd need to do is to either remove (hide) the encoding
declaration or to re-encode the doc back into its original format before
feeding it to the parser?

Richard



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]