[xml] Why HTML numeric entity reference — is parsed into 2 byte hex sequence c2 97?



I parsed the HTML claiming that it's HTML 4.01 (<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">) using 'htmlParseChunk' function. HTML contains entity reference &#151; and libxml2 parsed it as 2 bytes: c2 97. Browsers show it as double-dash.

HTML standard -- http://www.w3.org/TR/html401/sgml/entities.html -- doesn't mention this entity at all. Googling it gives some vague references to the "mdash" entity which according to HTML-4.01 standard has to look like &#8212; or &mdash;.
But this mdash symbol is displayed the same way as &#151; in browsers.

My questions are:
Why libxml2 parses &#151; as c2 97?
Does this HTML document violate HTML standard since it has illegal reference &151;? Where is the complete list of all such numeric entities that can be found in various HTML documents that aren't in standard but nevertheless are understood by browsers?

Yuri




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]