Re: [xml] Bug#500015: Cannot parse feed containing SOH character

On Thu, Sep 25, 2008 at 10:43:41PM +0200, Mike Hommey wrote:

I got this forwarded as a wishlist bug for libxml2, but that doesn't
sound right to me. I always thought control characters are not allowed
in XML, though looking in the XML spec, I can't find anything

Daniel, what do you think?

  Your mail was lost within around 150+ bounce mails accumulated on the
list (in a few days), make 100% sure your posting address is the one you're
subscribed with, with such a rate of bounce I can miss valid posts in
the mass of SPAMs and errors.

As a matter of fact, the XML spec says (

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

so  is not a valid char for an XML document.

  That's correct

I don't think this is a correct inference.  In, it says

 Consequently, XML processors MUST accept any character in the range
 specified for Char. ]

 Character Range

 [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |     /* any Unicode character,
              [#xE000-#xFFFD] |                     excluding the surrogate
          [#x10000-#x10FFFF]                    blocks, FFFE, and FFFF. */

but it doesn't specify that it must accept *only* characters in that
range.  In fact, the next paragraph states

 All XML processors MUST accept the UTF-8 and UTF-16 encodings of
 Unicode 3.1 ...

In, the
list of Unicode 3.1 characters, the SOH character is the second entry.

  That's bull...t

The allowed set of caracter is enumerated in the Char production, that
simple. Put a caracter out of that range in the document (whatever the
encoding used) and the processor MUST consider this a fatal error, raise
it to the application and stop passing data to the application from that
point in the document.


Daniel Veillard      | libxml Gnome XML XSLT toolkit
daniel veillard com  | Rpmfind RPM search engine | virtualization library

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]