Re: Removing form feed characters



On 09/09/2011 16:42, Murray Cumming wrote:
I recently discovered that XML doesn't allow the form feed character
(0xC) in text children (CDATA) even when escaped, though libxml does not
complain about it, and even writes it out to (then invalid) XML.

So I'm thinking about checking for it in Element::set_child_text(),
Element::add_child_text(), ContentNode::set_content() and others. We
could remove the character and maybe warn on stderr.

However, this could cause a processing slowdown, even when people are
not providing a string with that character. Thoughts?

Hello,

In fact, you can't escape character in CDATA block (http://www.w3.org/TR/2008/REC-xml-20081126/#sec-cdata-sect).
More over not all characters are valid. The only valid ones are (http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Char):
Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

So, from the ascii table, except tab (0x9), line feed (0xA) and carriage return (0xD), no character before space (0x20) is allowed. This includes common characters like feed form, escape and backspace but also numerous exotic ones.

I think this should be managed by the wrapped library libxml2 not libxml++.

Remark for all XML users: when CDATA blocks are used to store data other than XML (images), it is recommended to encode this data using Base64 or other encoder (see http://en.wikipedia.org/wiki/Base64)

Regards,
Mathias Lorente


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]