Re: [xml] Why HTML numeric entity reference — is parsed into 2 byte hex sequence c2 97?
- From: Mark Rowe <mrowe apple com>
- To: yuri rawbw com
- Cc: xml gnome org
- Subject: Re: [xml] Why HTML numeric entity reference — is parsed into 2 byte hex sequence c2 97?
- Date: Tue, 15 Jul 2008 20:52:42 -0700
On Jul 15, 2008, at 7:13 PM, Yuri wrote:
I parsed the HTML claiming that it's HTML 4.01 (<!DOCTYPE HTML
PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">) using
'htmlParseChunk' function.
HTML contains entity reference — and libxml2 parsed it as 2
bytes: c2 97. Browsers show it as double-dash.
HTML standard -- http://www.w3.org/TR/html401/sgml/entities.html --
doesn't mention this entity at all.
Googling it gives some vague references to the "mdash" entity which
according to HTML-4.01 standard has to look like — or —.
But this mdash symbol is displayed the same way as — in browsers.
libxml2 is following the HTML 4.01 specification here. Per <http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1
>, numeric character references specify the character as an ISO 10646
character number. As such, — corresponds to U+0097 (END OF
GUARDED AREA), which is represented as C2 97 when encoded as UTF-8.
Web browsers intentionally deviate from the standard to improve
compatibility with existing web content. They do this by changing
their behavior for numeric character references that correspond to the
Microsoft extensions to latin-1, known as windows-1252. Numeric
character references in the range 0x80 - 0x9f (128 - 159) are
interpreted as referring to characters in the windows-1252 character
set rather than to ISO 10646 character numbers. Following this logic,
— corresponds to U+2014 (EM DASH).
My questions are:
Why libxml2 parses — as c2 97?
Hopefully the explanation above addresses this.
Does this HTML document violate HTML standard since it has illegal
reference &151;?
No, — is a valid numeric character reference as per <http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1
>.
Where is the complete list of all such numeric entities that can be
found in various HTML documents that aren't in standard but
nevertheless are understood by browsers?
Based on the specification, the complete list of numeric character
references would be equivalent to the complete list of ISO 10646
character numbers.
Kind regards,
Mark Rowe
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]