RE: [xml] Perl module XML::LibXML not encoding UTF-8 properly [SOLUTION]



Daniel Veillard wrote:
UTF-8 makes certain assertions about how multi-byte characters are
represented.  While this code change doesn't check all of those
assumptions, but it does ensure that all the non-first bytes have
their
high bits set correctly.  This is likely to catch similar errors at
least regarding Latin characters.  If you are feeling ambitious,
feel
free to check for the assertion that code-points are encoded in the
fewest number of bytes possible.  This patch is untested, but I
prefer
that a developer more familiar with the libxml2 library give it a
more
thorough once over. 

  that problem is that you add this check in one APIs. I am mot sure
it make sense to do this on one entry point and not all the others.
I am not sure it makes sense to add the checking to all tree APIs
this could be extremely costly at runtime.

Yes, I was expecting such a reaction, but I felt justified putting the
check where I did because there was already a correctness check there. I
simply refined it a bit.  As far as whether this type of correctness
check be enforced on all entry-points is certainly an efficiency concern
that should be considered by libxml2's architects, but I simply wanted
to submit a code sample to demonstrate how this could be done.

Thanks,

-Loren



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]