Re: [xml] xmlSaveFormatFileEnc() creating invalid XML



On Wed, Sep 14, 2011 at 10:19:20AM +0200, Murray Cumming wrote:
On Wed, 2011-09-14 at 16:10 +0800, Daniel Veillard wrote:
On Fri, Sep 09, 2011 at 04:30:45PM +0200, Murray Cumming wrote:
On Fri, 2011-09-09 at 10:21 -0400, Jason Viers wrote:
On 9/9/2011 05:37, Murray Cumming wrote:
Here is a simple test case that takes the text from an apparently-valid
UTF-8 file

Not all valid UTF-8 is valid in XML.  Only a subset, as defined in
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets

Note that Form Feed (0xC) is not allowed.  Your original input document 
contains a formfeed character, and this is what ends up being invalid.  
It's not a matter of escaping; form feed as a literal byte, numeric 
reference, etc., is not allowed.
Stripping the form feed from the input allows it to serialize properly.

Ah, I didn't know that it couldn't be there even if escaped. Thanks.

Shouldn't libxml warn about that at the same time that it would escape
characters such as & and < rather than writing invalid XML?

  It's a choice, either you make all APIs validate all input strings
or you rely on the client to do it. In libxml2 I took the second path
and that was decided 10+ years ago. The parser on the other hand is
strict but that's mandatory to follow the spec.

OK. Thanks. Is that documented?

  yes and no,

you used http://xmlsoft.org/html/libxml-tree.html#xmlNewText
which used an xmlChar * which you casted from a string.
http://xmlsoft.org/FAQ.html#Developer at the end, the FAQ states:

---------------------------------------
# So what is this funky "xmlChar" used all the time?

It is a null terminated sequence of utf-8 characters. And only utf-8!
You need to convert strings encoded in different ways to utf-8 before
passing them to the API. This can be accomplished with the iconv library
for instance.
---------------------------------------

 usually we have problem with different encoding being passed rather
than error due to characters from Unicode but not accepted by XML (not
that many), maybe that should be made clearer.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]