Re: [xml] Bug in the regular expression parser; character range with escaped characters not working



On Sun, Jul 14, 2013 at 10:15:17AM +0800, Daniel Veillard wrote:
[apparently my mail bounced off the list, sending again !]

On Thu, Jul 11, 2013 at 06:06:33PM +0200, Dominik Skanda wrote:
Hi,

I think there is a bug in the regular expression parser for character
ranges, i.e. the character class [\]-a] with an escaped character (here
\]) is not recognized by the libXML regular expression parser.

E.g.: The simple type is not working:

<xs:simpleType name="LimitedString">
        <xs:restriction base="xs:string">
                <xs:pattern value="[\]-a]*"/>
        </xs:restriction>
</xs:simpleType>

but

<xs:simpleType name="LimitedString">
        <xs:restriction base="xs:string">
                <xs:pattern value="[Z-a]*"/>
        </xs:restriction>
</xs:simpleType>

If one looks at the ASCII table:

 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~ 

one sees that Z is the most right character before the character ] which
does not have to be escaped in the character range definition. This
indicates that there is a bug.

  I think it comes from the specification :-)

http://www.w3.org/TR/xmlschema-2/#nt-SingleCharEsc

  [24]            SingleCharEsc ::= '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]

if you look, in the specification productions, the [ and ] are used
consistently to express a choice of characters, e.g.

  [21]            XmlChar    ::=      [^\#x2D#x5B#x5D]

*but* in production 24 they expected to add [ and ] to that set of
characters and forgot to add them in (escaped as \[ and \] of course).

the fact that they should be allowed is actually given in the 2 last
exemples in the table below production [24]

 Someone (Liam ?) may be interested to check if there is an errata for
 it :)

  Whoops, I'm the one confused now. x5B#x5D do refer to [ and ]

 and libxml2 code being directly driven by the productions from the spec
well i forgot to add them.

  and the patch is clearly not fixing anything because (cur == 0x5B) ||
  (cur == 0x5D) are explicitely handled in the condition.

I will have to really investigate :-)

Daniel

I have prepared some example files to demonstrate the shortcoming:

test_not_validating.xsd is using the first simple type definition and
test_validating.xsd the second one, respectively.

You can try to validate test.xml with: 

xmllint --noout --schema test_not_validating.xsd ./test.xml

and

xmllint --noout --schema test_validating.xsd ./test.xml

respectively.

It would be nice if anyone could confirm the BUG and possibly solve it.

Regards,

 Could you test the patch provided as attachment ?

   thanks,

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/

diff --git a/xmlregexp.c b/xmlregexp.c
index 1f9911c..2f6d155 100644
--- a/xmlregexp.c
+++ b/xmlregexp.c
@@ -4883,7 +4883,7 @@ xmlFAParseCharClassEsc(xmlRegParserCtxtPtr ctxt) {
      (cur == '|') || (cur == '.') || (cur == '?') || (cur == '*') ||
      (cur == '+') || (cur == '(') || (cur == ')') || (cur == '{') ||
      (cur == '}') || (cur == 0x2D) || (cur == 0x5B) || (cur == 0x5D) ||
-     (cur == 0x5E)) {
+     (cur == 0x5E) || (cur == '[') || (cur == ']')) {
      if (ctxt->atom == NULL) {
          ctxt->atom = xmlRegNewAtom(ctxt, XML_REGEXP_CHARVAL);
          if (ctxt->atom != NULL) {


_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml gnome org
https://mail.gnome.org/mailman/listinfo/xml

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]