Re: [xml] HTML-parser: encoding?




I don't agree with the pre-processing, but _can_ agree with the iconv 
post-processing.  Would be nice if it would be part of xmllint, though...

  What are the range in 00-FF which are not part of ISO 8859-1,
is that just [80-A0[ ?

Actually it's not even [80-A0[.

As I recently learned (from Martin v. Loewis on the python xml mailing list):
*All* bytes are valid charaters in ISO-8859-1 (it is a common
misconception about Latin-1 that 128-159 are not defined).

see
http://208.56.196.240/misc/ISO-8859-1.HTML

so [80-A0[ are valid in ISO-8859-1 though they do not encode 
characters.

They aren't ruled out as characters in xml also
(Char    ::=    #x9 | #xA | #xD | 
               [#x20-#xD7FF] | 
               [#xE000-#xFFFD] | 
               [#x10000-#x10FFFF])

So IMHO one correct way to handle these is just converting them to 
utf8 (if that's the output encoding) or leave them as they are, if the
output encoding is iso-8859-1.

greetings
        Morus



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]