RE: [xml] HTML-parser: encoding?



While those code points can used for character data in XML (and XHTML), they
are not valid in a HTML document. 

In HTML 2.0 (http://www.ietf.org/rfc/rfc1866.txt), they are not defined as
text characters in the SGML declaration (section 9.5, "SGML Declaration for
HTML"), and it is stated that "a larger character repertoire will be
specified in a future version of HTML. The document character set will be
[ISO-10646], or some subset that agrees with [ISO-10646]; in particular, all
numeric character references must use code positions assigned by
[ISO-10646]."

Neither are they valid in a HTML 3.2 document, see
http://www.w3.org/TR/REC-html32#sgmldecl. 

HTML 4.0 specifies how to state the character set in a HTML 4.0 document.
The character encoding
(http://www.w3.org/TR/1998/REC-html40-19980424/types.html#type-charset) must
be registered with IANA (http://www.iana.org/assignments/character-sets)
according to http://www.ietf.org/rfc/rfc2278.txt. Here's the record for
Microsoft CP 1252:

  Name: windows-1252
  MIBenum: 2252
  Source: Microsoft  (see ../character-set-info/windows-1252)       [Wendt]
  Alias: None

The descriptions are available at
http://www.isi.edu/in-notes/iana/assignments/character-set-info/; they'll
usually point to a published specification, in this case
http://www.microsoft.com/globaldev/reference/sbcs/1252.htm. You'll find an
image of the character set and the mapping to ISO-10646.

To the extend that HTML 4.0 and HTML 4.01 are SGML applications, the code
points 0x7f .. 0x9f are still not valid as character data, see
http://www.w3.org/TR/1998/REC-html40-19980424/sgml/sgmldecl.html and
http://www.w3.org/TR/html4/sgml/sgmldecl.html. So, strictly speaking, even
with charset=windows-1252 you are not allowed to encode "U+201C : LEFT
DOUBLE QUOTATION MARK" as a 0x93 byte. The conforming method would be a
character entity reference, “, which is defined in "-//W3C//ENTITIES
Special//EN//HTML", or a numerical reference, “, or (for HTML 4.01)
“.

libxml, being a XML parser, might be expected to ignore the SGML
declaration, but it is still, I think, a failure mode that should not go
unnoticed. What about a call-back so that the application could decide what
to do?

Kind regards,
Peter Ring


-----Original Message-----
From: Daniel Veillard [mailto:veillard redhat com]
Sent: Thursday, November 29, 2001 11:43 PM
To: Elizabeth Mattijsen
Cc: Melvyn Sopacua; xml gnome org
Subject: Re: [xml] HTML-parser: encoding?


On Thu, Nov 29, 2001 at 11:28:59PM +0100, Elizabeth Mattijsen wrote:
At 05:25 PM 11/29/01 -0500, Daniel Veillard wrote:
Changing it to numeric entities would actually be best, as it wouldn't

lose
any information.  Hmm...  but can you actually do that?  Wouldn't the
next
time you read this into an xml parser, re-create the encoding error
again
(having the entity processed)?
  no it would be fine as long as they are in the ranges defined by
XML as valid Chars.

Yes, but they aren't!  That's why they are causing problems in the first 
place.  Or am I missing something here?

  yes 7f-A0 are in the range
  http://www.w3.org/TR/REC-xml#NT-Char

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml gnome org
http://mail.gnome.org/mailman/listinfo/xml



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]