[xml] Why HTML numeric entity reference — is parsed into 2 byte hex sequence c2 97?
- From: Yuri <yuri rawbw com>
- To: xml gnome org
- Subject: [xml] Why HTML numeric entity reference — is parsed into 2 byte hex sequence c2 97?
- Date: Tue, 15 Jul 2008 19:13:43 -0700
I parsed the HTML claiming that it's HTML 4.01 (<!DOCTYPE HTML PUBLIC
"-//W3C//DTD HTML 4.01 Transitional//EN">) using 'htmlParseChunk' function.
HTML contains entity reference — and libxml2 parsed it as 2 bytes:
c2 97. Browsers show it as double-dash.
HTML standard -- http://www.w3.org/TR/html401/sgml/entities.html --
doesn't mention this entity at all.
Googling it gives some vague references to the "mdash" entity which
according to HTML-4.01 standard has to look like — or —.
But this mdash symbol is displayed the same way as — in browsers.
My questions are:
Why libxml2 parses — as c2 97?
Does this HTML document violate HTML standard since it has illegal
reference &151;?
Where is the complete list of all such numeric entities that can be
found in various HTML documents that aren't in standard but nevertheless
are understood by browsers?
Yuri
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]