[xml] Problem with CDATA entities



The problem in this article is related to the Debian libxml2, here's
the version report of xmllint:

  xmllint: using libxml version 20616
  compiled with: DTDValid FTP HTTP HTML C14N Catalog XPath XPointer XInclude Iconv Unicode Regexps Automata 
Schemas 


I'm having a problem with CDATA entities. You can see the same problem
by doing this:

  xmllint http://www.oreillynet.com/meerkat/?_fl=rss10&t=ALL&c=5136

In other words download the O'Reilly ONJAVA RSS feed. This feed uses
an HTML DTD include like this:

  <!DOCTYPE rdf:RDF [
  <!ENTITY % HTMLlat1 PUBLIC
     "-//W3C//ENTITIES Latin1//EN//HTML"
     "http://www.w3.org/TR/PR-html40/HTMLlat1.ent";>
  %HTMLlat1;
  ]>

The w3 dtd has this in it:

  <!ENTITY nbsp   CDATA "&#160;" -- no-break space = non-breaking space, 
                                    U+00A0 ISOnum -->
  <!ENTITY iexcl  CDATA "&#161;" -- inverted exclamation mark, U+00A1 ISOnum -->
  <!ENTITY cent   CDATA "&#162;" -- cent sign, U+00A2 ISOnum -->
  <!ENTITY pound  CDATA "&#163;" -- pound sign, U+00A3 ISOnum -->
  <!ENTITY curren CDATA "&#164;" -- currency sign, U+00A4 ISOnum -->
  <!ENTITY yen    CDATA "&#165;" -- yen sign = yuan sign, U+00A5 ISOnum -->
  <!ENTITY brvbar CDATA "&#166;" -- broken bar = broken vertical bar,
                                    U+00A6 ISOnum -->

And the error from xmllint one gets is related directly to this:

  http://www.w3.org/TR/html40/HTMLlat1.ent:12: parser error : Entity value required
  <!ENTITY nbsp   CDATA "&#160;" -- no-break space = non-breaking space,
                  ^
  http://www.w3.org/TR/html40/HTMLlat1.ent:12: parser error : Space required before 'NDATA'
  <!ENTITY nbsp   CDATA "&#160;" -- no-break space = non-breaking space,
                  ^
  http://www.w3.org/TR/html40/HTMLlat1.ent:12: parser error : xmlParseEntityDecl: entity nbsp not terminated
  <!ENTITY nbsp   CDATA "&#160;" -- no-break space = non-breaking space,
                ^

This is clearly wrong, the CDATA is declaring that the entity is not
to be parsed further. Expanding nbsp as declared above for example
will result in:

  &nbsp;   =>    '&#160;'

whereas:

  <!ENTITY nbsp "&#160;">

will expand to:

  &nbsp;   =>    ' '


Interestingly, this:

  http://www.flightlab.com/~joe/sgml/cdata.html

suggests that there is common confusion about CDATA entities.



Nic Ferrier



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]