Re: [xml] Why HTML numeric entity reference — is parsed into 2 byte hex sequence c2 97?




On Jul 15, 2008, at 7:13 PM, Yuri wrote:

I parsed the HTML claiming that it's HTML 4.01 (<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">) using 'htmlParseChunk' function. HTML contains entity reference &#151; and libxml2 parsed it as 2 bytes: c2 97. Browsers show it as double-dash.

HTML standard -- http://www.w3.org/TR/html401/sgml/entities.html -- doesn't mention this entity at all. Googling it gives some vague references to the "mdash" entity which according to HTML-4.01 standard has to look like &#8212; or &mdash;.
But this mdash symbol is displayed the same way as &#151; in browsers.

libxml2 is following the HTML 4.01 specification here. Per <http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1 >, numeric character references specify the character as an ISO 10646 character number. As such, &#151; corresponds to U+0097 (END OF GUARDED AREA), which is represented as C2 97 when encoded as UTF-8.

Web browsers intentionally deviate from the standard to improve compatibility with existing web content. They do this by changing their behavior for numeric character references that correspond to the Microsoft extensions to latin-1, known as windows-1252. Numeric character references in the range 0x80 - 0x9f (128 - 159) are interpreted as referring to characters in the windows-1252 character set rather than to ISO 10646 character numbers. Following this logic, &#151; corresponds to U+2014 (EM DASH).

My questions are:
Why libxml2 parses &#151; as c2 97?

Hopefully the explanation above addresses this.

Does this HTML document violate HTML standard since it has illegal reference &151;?

No, &#151; is a valid numeric character reference as per <http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1 >.

Where is the complete list of all such numeric entities that can be found in various HTML documents that aren't in standard but nevertheless are understood by browsers?

Based on the specification, the complete list of numeric character references would be equivalent to the complete list of ISO 10646 character numbers.

Kind regards,

Mark Rowe




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]