Re: [xml] Possible bug with Byte Order Marks



On Mon, Jun 02, 2003 at 10:29:58AM +0100, Mark Itzcovitz wrote:
Using xmlDocFormatDump to output a document that is encoded in UTF-16 to a
file, the BOM is initially created in the output buffer but then overwritten
by the start of the document. I believe this is because of a bug in the
function xmlCharEncOutFunc in encoding.c when the input buffer pointer is
NULL. The output handler returns the number of bytes written (maybe not the
case for all output handlers?) so the line that reads:

          if (ret == 0) { /* Gennady: check return value */

should read:

          if (ret >= 0) { /* Gennady: check return value */

  Hum, this looks right, will do.

I notice that the iconv code just below doesn't have this check at all - I
wonder if it should. I'm not using iconv at the moment so I can't easily
test it.

  Hum, let's assume it's not broken :-)

I also have a query for documents that have a BOM and where the encoding
declaration specifies UTF-16 (not UTF-16LE or UTF-16BE). There is no problem
reading in the document, but xmlDocDumpMemory fails, I think because it
can't decide what encoding handler to use. xmlDocFormatDump doesn't fail but
outputs the document in UTF-8.

Should an encoding declaration of UTF-16 work?

  yes, and default to Windows endianness since they are the main users of
UTF16, I take patches !

I'm using version 2.5.7 on Windows (and Solaris (and OpenVMS)).

  okay, thanks !

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]