[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [xml] Possible bug with Byte Order Marks



On Tue, Jun 03, 2003 at 03:04:13PM +0100, Mark Itzcovitz wrote:
> On reflection, I think I'm wrong in what I say above. RFC2781 is about MIME
> types - the xml spec seems to say that the encoding declaration should just
> say UTF-16 and is used in conjunction with the BOM.

  okay,

> >   That should work, with the caveat you saw. I'm a bit concerned about
> > the requirement to add a field to record the encoding, this should laready
> > be stored somewhere on the context or in the inputStream block.
> 
> The encoding from the encoding declaration is stored, but I can't see where
> the encoding derived from xmlDetectCharEncoding is stored.

  okay, now the problem is that it's not a parser information but an
entity information, so ideally this should be saved in the input structure
block. But I think the encoding="" value is always finer grained than
the result of xmlDetectCharEncoding except in that case of UTF-16, 

> >   Seems one way, the other way would be in case of just "UTF-16" being
> > passed
> > to actually serialize a BOM on output to keep something similar, except
> > we would always dump big endian.
> >   Either solution should work, the second one is slightly more
> > conservative.
> > 
> 
> Returning to my original query, which was that xmlDocDumpMemory and
> xmlDocFormatDump don't work correctly for "UTF-16", and having looked more
> closely at the code for those functions, I think that my proposed changes
> have too broad a scope. I can see a different solution that can easily be
> applied to those two functions but I am confused by what seems to me to be
> an inconsistency, as follows:
> 
> A call to xmlFindCharEncodingHandler for "UTF-16" fails.
> A call to xmlParseCharEncoding for "UTF-16" followed by a call to
> xmlGetCharEncodingHandler returns the handler for XML_CHAR_ENCODING_UTF16LE.

  the problem is that you add some state information, if you can keep this in
the local variables of the serialization routine then that's fine.

> The two Dump functions call xmlParseCharEncoding followed by
> xmlFindCharEncodingHandler. I propose putting a call to
> xmlGetCharEncodingHandler (using the result from the call to
> xmlParseCharEncoding), and only calling Find if the Get fails. This is
> hopefully a safe change.

  Hum, sounds better, could you give a patch ?

    thanks,

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]