Re: [xml] UTF-8 validation



On Fri, Oct 05, 2007 at 04:10:56PM -0700, Norbert Lindenberg wrote:
Hi there,

Can you tell me whether libxml2 does complete validation of UTF-8  
when input is provided in this character encoding? By complete  
validation I mean:

- Verifying that each character is represented by a byte sequence  
that matches one of the patterns described in section 3 of RFC 3629.

- Verifying that each character is represented by the shortest  
possibly byte sequence (ruling out, for example the use of 0xC0 0x80  
for U+0000).

- Verifying that supplementary characters are represented by a 4-byte  
sequence, not by a pair of surrogate characters.

- Verifying that illegal code points, such as the not-a-character  
characters, U+FFFE, U+FFFF, etc., do not occur.

Bug report 305333 implies that some of this validation occurs, but  
the references to the obsolete RFC 2044 in the documentation worry me  
a bit.

  libxml2 does checking of UTF-8 sequences when parsing documents. It
don't do checks from the APIs to modify or create document, xmlChar*
are assumed to be correct UTF-8 strings. 
  W.r.t. the checks they are based on the caracter ranges,
 see http://www.w3.org/TR/REC-xml/#NT-Char
this ensures that U+0000 or surrogates for examples are generating
fatal errors if encountered.
  Could you explain your concerns in terms of the XML character range
framework, in case my answer sounds incomplete to you,

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]