Re: [xml] Bug in encoding detection with document()

From: Chuck Bearden <cbearden rice edu>
To: Bjoern Hoehrmann <derhoermi gmx net>
Cc: xml gnome org
Subject: Re: [xml] Bug in encoding detection with document()
Date: Mon, 23 Mar 2009 16:21:12 -0500

Bjoern Hoehrmann wrote:

* Chuck Bearden wrote:
It appears that libxslt1.1 pays attention to the charset declaration in theContent-Type HTTP header when retrieving XML files with MIME types ofapplication/xml or text/xml via the document() function. If a misconfiguredweb server sends "Content-Type: text/xml; charset=iso-8859-15" but the XMLfile itself has no encoding declaration in the XML prolog (and is thus to betaken as UTF-8), libxslt treats the incoming file as ISO-8859-15 and somangles byte sequences that express e.g. many common vowels with diacritics.
The charset parameter takes precedence over internal labels and defaults
so it is the misconfigured server that mangles those sequences. See e.g.
RFC 3023 for a discussion.

Thanks for the information. So it looks like in this case Saxon 6.5.5 is notfollowing the RFC.

When you say that the misconfigured server mangles the bytes, I take it thatyou mean it does so by virtue of giving the wrong information to libxslt. Thetest files are byte-for-byte identical when retrieved with wget, so theyaren't directly modified by the server.


Thanks again for the info.  I appreciate the pointers.
Chuck
--
Chuck Bearden (cbearden rice edu ; 713.348.3661)
XML Engineer, Connexions
http://cnx.org/

Follow-Ups:
- Re: [xml] Bug in encoding detection with document()
  - From: Daniel Veillard

References:
- [xml] Bug in encoding detection with document()
  - From: Chuck Bearden
- Re: [xml] Bug in encoding detection with document()
  - From: Bjoern Hoehrmann

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]