Re: [xml] disabling entity replacement



* Alex Khesin <alexk google com> [2006-04-19 07:00]:
"If type="html", then this element contains entity escaped
html.

<title type="html">
 AT&amp;amp;T bought &lt;b&gt;by SBC&lt;/b&gt;!
</title>

Yes. There was a long discussion about whether to allow this at
all. In the end the Atom WG conceded that reality is suboptimal.
Escaped markup is still considered harmful.[1]

The spec might be broken, from XML perspective, but it is
already in the wild.

Err, yeah, Atom is broken because it lets you transport XML as
XML. Exactlyâ

Here is a snippet from a valid Atom 1.0 feed,
http://www.intertwingly.net/blog/index.atom:

<content type="xhtml">
  ...
 <pre class="code">&lt;script src="pager.js" type="text/javascript" 
 /&gt;</pre>

Sure. That is completely and entirely unambiguous. When the SAX
parser gives you CHARACTER events, you get a string with all
entities decoded â but *itâs a string.* Period. No markup
anywhere in sight. If there are anglebrackets inside, they donât
mean anything.

But I now know how to fix this, taking inspiration from
http://feedparser.org/ - I will introduce entities back when
type="xhtml".  Suboptimal, but works.

Itâs not suboptimal. Itâs exactly how XML works. The parser
always gives you decoded strings. Period. If you give this string
to a serialiser, then the serialiser will make sure it remains a
string by escaping any characters appropriately. Period. You
should not even think about which characters have special
meaning. You let the parser decode and you let the serialiser
encode. If you do it any other way, you risk being called a
bozo[2], opening yourself up to cross-site scripting attacks,
etc.

[1]: http://norman.walsh.name/threads/nwn-escapedmarkup
[2]: http://hsivonen.iki.fi/producing-xml/

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]