Re: [xml] xmlSaveFormatFileEnc() creating invalid XML
- From: Daniel Veillard <veillard redhat com>
- To: Murray Cumming <murrayc murrayc com>
- Cc: xml gnome org
- Subject: Re: [xml] xmlSaveFormatFileEnc() creating invalid XML
- Date: Wed, 14 Sep 2011 16:34:52 +0800
On Wed, Sep 14, 2011 at 10:19:20AM +0200, Murray Cumming wrote:
On Wed, 2011-09-14 at 16:10 +0800, Daniel Veillard wrote:
On Fri, Sep 09, 2011 at 04:30:45PM +0200, Murray Cumming wrote:
On Fri, 2011-09-09 at 10:21 -0400, Jason Viers wrote:
On 9/9/2011 05:37, Murray Cumming wrote:
Here is a simple test case that takes the text from an apparently-valid
UTF-8 file
Not all valid UTF-8 is valid in XML. Only a subset, as defined in
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets
Note that Form Feed (0xC) is not allowed. Your original input document
contains a formfeed character, and this is what ends up being invalid.
It's not a matter of escaping; form feed as a literal byte, numeric
reference, etc., is not allowed.
Stripping the form feed from the input allows it to serialize properly.
Ah, I didn't know that it couldn't be there even if escaped. Thanks.
Shouldn't libxml warn about that at the same time that it would escape
characters such as & and < rather than writing invalid XML?
It's a choice, either you make all APIs validate all input strings
or you rely on the client to do it. In libxml2 I took the second path
and that was decided 10+ years ago. The parser on the other hand is
strict but that's mandatory to follow the spec.
OK. Thanks. Is that documented?
yes and no,
you used http://xmlsoft.org/html/libxml-tree.html#xmlNewText
which used an xmlChar * which you casted from a string.
http://xmlsoft.org/FAQ.html#Developer at the end, the FAQ states:
---------------------------------------
# So what is this funky "xmlChar" used all the time?
It is a null terminated sequence of utf-8 characters. And only utf-8!
You need to convert strings encoded in different ways to utf-8 before
passing them to the API. This can be accomplished with the iconv library
for instance.
---------------------------------------
usually we have problem with different encoding being passed rather
than error due to characters from Unicode but not accepted by XML (not
that many), maybe that should be made clearer.
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
daniel veillard com | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library http://libvirt.org/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]