Re: [xml] HTMLparser: UTF-8 byte order mark

From: Bjoern Hoehrmann <derhoermi gmx net>
To: veillard redhat com
Cc: xml gnome org
Subject: Re: [xml] HTMLparser: UTF-8 byte order mark
Date: Tue, 03 Jan 2006 22:12:25 +0100

* Daniel Veillard wrote:

 Hum, I don't know how it should be processed in theory ! In XML
the BOM is fine at the beginning of a document entity in UTF-8 or UTF-16
but will usually mess things up in different encodings. For HTML I don't
know what the theory suggests. For compatibility I guess the character
should be dropped if detected.


HTML character encoding detection is a terrible mess and last time I
checked libxml2 was not a compliant implementation in that it considered
<meta> elements encoding switches and won't re-parse content preceding
the <meta> element (much unlike browsers). Browsers typically treat the
BOM here as they would do for XML documents.
-- 
Björn Höhrmann · mailto:bjoern hoehrmann de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

References:
- Re: [xml] HTMLparser: UTF-8 byte order mark
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]