[xml] Bug in encoding detection with document()



It appears that libxslt1.1 pays attention to the charset declaration in the Content-Type HTTP header when retrieving XML files with MIME types of application/xml or text/xml via the document() function. If a misconfigured web server sends "Content-Type: text/xml; charset=iso-8859-15" but the XML file itself has no encoding declaration in the XML prolog (and is thus to be taken as UTF-8), libxslt treats the incoming file as ISO-8859-15 and so mangles byte sequences that express e.g. many common vowels with diacritics. libxslt does not exhibit the behavior when the MIME type is 'text/html'. Saxon 6.5.5 does not exhibit the same behavior with any MIME type/charset combination.

I am attaching a test stylesheet that takes itself as input, and retrieves a simple file in UTF-8 and Latin-9 encodings from a webserver, and outputs the results with MIME types and charsets noted. I have confirmed the bug in libxslt 1.1.24--would anyone care to check it in more recent versions before I log the bug?

Thanks,
Chuck
--
Chuck Bearden (cbearden rice edu ; 713.348.3661)
XML Engineer, Connexions
http://cnx.org/

Attachment: test.xsl
Description: application/xml



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]