RE: [xml] HTML-parser: encoding?



Peter Ring wrote at 30 Nov 2001 11:15:25 +0100:
...
HTML 4.0 specifies how to state the character set in a HTML 4.0 document.
The character encoding
(http://www.w3.org/TR/1998/REC-html40-19980424/types.html#type-charset) must

This refers only to "charset" attributes, which is the lowest priority
way of specifying the character encoding in HTML.

Section 5.2.2., Specifying the character encoding, [1] defines three
ways to specify the character encoding:

1. An HTTP "charset" parameter in a "Content-Type" field. 
2. A META declaration with "http-equiv" set to "Content-Type" and a
   value set for "charset".
3. The charset attribute set on an element that designates an external
   resource.

...
To the extend that HTML 4.0 and HTML 4.01 are SGML applications, the code
points 0x7f .. 0x9f are still not valid as character data, see

That is not true.

The HTML document can be in any encoding, provided the user agent can
map from the characters used in the document to the characters used in
the document character set.

http://www.w3.org/TR/1998/REC-html40-19980424/sgml/sgmldecl.html and
http://www.w3.org/TR/html4/sgml/sgmldecl.html. So, strictly speaking, even
with charset=windows-1252 you are not allowed to encode "U+201C : LEFT
DOUBLE QUOTATION MARK" as a 0x93 byte. The conforming method would be a
character entity reference, “, which is defined in "-//W3C//ENTITIES
Special//EN//HTML", or a numerical reference, “, or (for HTML 4.01)
“.

libxml, being a XML parser, might be expected to ignore the SGML
declaration, but it is still, I think, a failure mode that should not go
unnoticed. What about a call-back so that the application could decide what
to do?

XML is (or was intended to be) SGML on the Web.  XML uses SGML's
rationalisation of the independence of the storage representation of
the document (i.e., the document's character encoding) from the
document's document character set.  The same rationalisation is
applied to HTML.  That's why numeric character references like “
work in both HTML and XML no matter what character encoding is used
for the document.

XML uses MIME headers and the XML declaration and text declaration to
indicate the document's (actually, the entity's) character encoding.
HTML uses the HTTP "charset" parameter, the <META> element, and the
"charset" attribute to indicate the document's character encoding.
Once the HTML user agent or the XML processor "knows" the document's
character encoding, it can map from the character encoding (provided
it recognises the encoding) to the document character set: ISO/IEC
10646 for HTML 4.0, Unicode 3.0 (and beyond) for XML.

Regards,


Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin                mailto:tony graham sun com
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708


[1] http://www.w3.org/TR/1998/REC-html40-19980424/charset.html#spec-char-encoding



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]