Re: [xml] HTMLparser: UTF-8 byte order mark
- From: Bjoern Hoehrmann <derhoermi gmx net>
- To: veillard redhat com
- Cc: xml gnome org
- Subject: Re: [xml] HTMLparser: UTF-8 byte order mark
- Date: Tue, 03 Jan 2006 22:12:25 +0100
* Daniel Veillard wrote:
Hum, I don't know how it should be processed in theory ! In XML
the BOM is fine at the beginning of a document entity in UTF-8 or UTF-16
but will usually mess things up in different encodings. For HTML I don't
know what the theory suggests. For compatibility I guess the character
should be dropped if detected.
HTML character encoding detection is a terrible mess and last time I
checked libxml2 was not a compliant implementation in that it considered
<meta> elements encoding switches and won't re-parse content preceding
the <meta> element (much unlike browsers). Browsers typically treat the
BOM here as they would do for XML documents.
--
Björn Höhrmann · mailto:bjoern hoehrmann de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]