[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] Bug#500015: Cannot parse feed containing SOH character
- From: Mike Hommey <mh glandium org>
- To: xml gnome org
- Subject: Re: [xml] Bug#500015: Cannot parse feed containing SOH character
- Date: Thu, 25 Sep 2008 22:43:41 +0200
Hi,
I got this forwarded as a wishlist bug for libxml2, but that doesn't
sound right to me. I always thought control characters are not allowed
in XML, though looking in the XML spec, I can't find anything
definitive...
Daniel, what do you think?
Mike
PS: You can see the whole thread on
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=500015
On Wed, Sep 24, 2008 at 07:30:39PM -0700, Matt Kraai wrote:
> On Wed, Sep 24, 2008 at 10:12:41AM -0700, Rodrigo Gallardo wrote:
> > > The feed at
> > >
> > > http://jc.ngo.org.uk/~nik/use.perl.journals.rss
> > >
> > > currently contains a SOH character (i.e., the 0x01 character). When I
> > > click on it in Liferea, it displays the following error message:
> > >
> > > XML Parsing Error: reference to invalid character number
> > > Location: file:///
> > > Line Number 20, Column 45:
> > >
> > > <pre>Aha. On the line 580 of that we have a  character. Which leads me to
> > > --------------------------------------------^
> > >
> > > The feed has a UTF-8 encoding declaration and the SOH character is a
> > > valid Unicode character, so I think this error is in error.
> >
> > As a matter of fact, the XML spec says (http://www.w3.org/TR/REC-xml/#dt-character)
> > that
> >
> > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> >
> > so  is not a valid char for an XML document.
>
> I don't think this is a correct inference. In
> http://www.w3.org/TR/REC-xml/#charsets, it says
>
> Consequently, XML processors MUST accept any character in the range
> specified for Char. ]
>
> Character Range
>
> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | /* any Unicode character,
> [#xE000-#xFFFD] | excluding the surrogate
> [#x10000-#x10FFFF] blocks, FFFE, and FFFF. */
>
> but it doesn't specify that it must accept *only* characters in that
> range. In fact, the next paragraph states
>
> All XML processors MUST accept the UTF-8 and UTF-16 encodings of
> Unicode 3.1 ...
>
> In http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt, the
> list of Unicode 3.1 characters, the SOH character is the second entry.
>
> --
> Matt http://ftbfs.org/
>
>
>
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]