Re: [xml] HTMLparser: UTF-8 byte order mark



* Daniel Veillard wrote:
 Hum, I don't know how it should be processed in theory ! In XML
the BOM is fine at the beginning of a document entity in UTF-8 or UTF-16
but will usually mess things up in different encodings. For HTML I don't
know what the theory suggests. For compatibility I guess the character
should be dropped if detected.

HTML character encoding detection is a terrible mess and last time I
checked libxml2 was not a compliant implementation in that it considered
<meta> elements encoding switches and won't re-parse content preceding
the <meta> element (much unlike browsers). Browsers typically treat the
BOM here as they would do for XML documents.
-- 
Björn Höhrmann · mailto:bjoern hoehrmann de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]