Re: [xml] Problem with CDATA entities



Daniel Veillard <veillard redhat com> writes:

The w3 dtd has this in it:

  <!ENTITY nbsp   CDATA "&#160;" -- no-break space = non-breaking space, 
                                    U+00A0 ISOnum -->
[...]
And the error from xmllint one gets is related directly to this:

  http://www.w3.org/TR/html40/HTMLlat1.ent:12: parser error : Entity value required

  Your XML file reference an SGML DTD fragment which has a different syntax.
Your XML is as a result not an XML file, it is not well formed, but only
a validating XML parser fetching the external subset can detect it.
  As far as I can tell xmllint is right and the error message is quite accurate.

<snip/>

  In SGML ! You are using an XML parser. Show me how you generate

   <!ENTITY nbsp   CDATA "&#160;" 

from the production [70] of
  http://www.w3.org/TR/REC-xml/#NT-EntityDecl

Seems people are so used to digest any crap in RSS that they didn't even
managed to find this monstruosity any validating XML parser should show.
Blame them, not libxml2, thanks.

I apoligize Daniel. You are, of course, quite right.

The clue is even there in the DTD file where it says SGML. It is also
referenced in the HTML 4.01 specification as SGML.

The XHTML specification also mentions it and provides the correct URL
for an XML valid DTD:

   http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent

I've asked O'Reilly to switch their feed to this.


As you say, it probably shows that people are using cruddy tools to
parse RSS which is a shame.

I've written an RSS aggregator using libxml2 (and libxslt) from a
mixture of shell and python. It works quite well saving issues like
this.

I'll release it as free software at some point.



Nic Ferrier



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]