RE: [xml] HTML-parser: encoding?



Well I sort of hoped for someone to point out that I'd missed something.

Are you saying that the user agent is free to ignore (part of) the SGML
declaration wrt. CHARSET?

The guidance in the HTML 4.01 rec amounts to this:

"5.2.1 Choosing an encoding http://www.w3.org/TR/html4/charset.html#h-5.2.1
Authoring tools (e.g., text editors) may encode HTML documents in the
character encoding of their choice, and the choice largely depends on the
conventions used by the system software. These tools may employ any
convenient encoding that covers most of the characters contained in the
document, provided the encoding is _correctly labeled_. Occasional
characters that fall outside this encoding may still be represented by
_character references_. These always refer to the document character set,
not the character encoding."

"Correctly labeled" can achieved in a number of ways, as you point out, but
must refer to a IANA registered character set in any case.

And "character references" are described as:

"5.3 Character references http://www.w3.org/TR/html4/charset.html#h-5.3
A given character encoding may not be able to express all characters of the
document character set. For such encodings, or when hardware or software
configurations do not allow users to input some document characters
directly, authors may use SGML character references. Character references
are a character encoding-independent mechanism for entering any character
from the document character set.

Character references in HTML may appear in two forms:

 * Numeric character references (either decimal or hexadecimal). 
 * Character entity references."

Which might be interpreted to the effect that in case of a windows-1252
encoded document, "U+201C : LEFT DOUBLE QUOTATION MARK" can be encoded as

 * A character entity reference, “
 * A numerical reference to the document character set (ISO-10646), 
   “, or (for HTML 4.01) “. 

But the following is not a valid encoding:

  * A 0x93 byte (because of the SGML declaration).

And the following is not possible:

  * A numerical reference to the documents character encoding (windows-1252)
    “ or “ (because numerical references go to ISO-10646, and at
that 
    code point, there's a C1 control, SET TRANSMIT STATE).

If we know the character encoding is windows-1252 or whatever, we can safely
map invalid character data and almost safely map obviously unintended
numerical references to the intended characters.

If we don't know the encoding, then what? How do we preserve the fact that
these characters represent something that we don't know?


Kind regards,
Peter Ring


-----Original Message-----
From: Tony Graham [mailto:tony graham sun com]
Sent: Friday, November 30, 2001 12:02 PM
To: xml gnome org
Subject: RE: [xml] HTML-parser: encoding?


<snip />

...
To the extend that HTML 4.0 and HTML 4.01 are SGML applications, the code
points 0x7f .. 0x9f are still not valid as character data, see

That is not true.

The HTML document can be in any encoding, provided the user agent can
map from the characters used in the document to the characters used in
the document character set.

<snip />



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]