Re: [xml] xmllint reports non-determinist content model for schema



On Thu, Aug 23, 2012 at 09:03:44PM +0800, Daniel Veillard wrote:
On Wed, Aug 22, 2012 at 10:40:20PM +0200, Johan Corveleyn wrote:
[...]
Is there some (preferable concise) reading about basic concepts like
how xml schemas automata work, how these are built, etc? What should
the automata in this case look like (in terms of the data structures
inside the code)? How can I see that the automata is non-determinist,
and what would a determinist automata look like, ... ?

    Johan,

  sorry for the slow answer,

I will look at this, promise, but all that code is rooted in automata
theory, there is quite some litterature about it, and well unless you
go though the process of learning that code and possibly the theory
behind it, it's not something that can be hacked trivially. Basically
I run this in debug mode, look at the automata generated, there is pro
bably an error there and then this need fixing and a lot of checking.
To be honnest that one of the hardest part of the library to change
with the core of the parser.
  Theory is usually learnt though the 'Dragon Book', by Aho and Al.
and the code is actually more complex since using extended automata
operations not found in basic automata ...

    If you have time looking at this may be interesting, but
it's really not trivial at least more complex than other parts of the
code :-)

  Okay simce people may get curious:
    1/ I fixed the issue see the tiny patch attached
    2/ following is an explanation of what I did to debug the issue
       in case people may want to fix other issues there of if they are
       just curious how a developper may spend half an hour of his time :-)

Steps:
0/ make minimal xsd and xml the one given were quite good
1/ think about the pattern that is being expressed, it's
   rule * (specialRule rule *) ?
   that should be determinist actually so yes that looks like a bug
2/ run xmllint --schema tst.xsd tst.xml
   clearly libxml2 regexp think it's not determinist, and there is
   a construction problem as hinted before
3/ edit xmlregexp.c comment off the 4 debugging macros at the top
   recompile xmllint and run it on the example
4/ start to study the nice output given:

    -------------------------------
    thinkpad:~/XML -> xmllint --schema tst.xsd tst.xml
    Add trans from 0 to 1  atom: string once 'rule'
    Add trans from 1 to 1  atom: string once 'rule'
    Add trans from 0 to 1 epsilon transition
    Add trans from 1 to 2  atom: string once 'specialRule'
    Add trans from 2 to 3  atom: string once 'rule'
    Add trans from 3 to 3  atom: string once 'rule'
    Add trans from 2 to 3 epsilon transition
    Add trans from 1 to 3 epsilon transition
    Found epsilon trans 1 from 2 to 3
    xmlFAReduceEpsilonTransitions(2, 3)
    State 3 is final, so 2 becomes final
    Add trans from 2 to 3  atom: string once 'rule'
    Found epsilon trans 2 from 1 to 3
    xmlFAReduceEpsilonTransitions(1, 3)
    State 3 is final, so 1 becomes final
    Add trans from 1 to 3  atom: string once 'rule'
    Found epsilon trans 1 from 0 to 1
    xmlFAReduceEpsilonTransitions(0, 1)
    State 1 is final, so 0 becomes final
    Add trans from 0 to 1  atom: string once 'rule'
    Add trans from 0 to 2  atom: string once 'specialRule'
    Add trans from 0 to 3  atom: string once 'rule'
    xmlFAComputesDeterminism
     ctxt: '(null)'
    5 atoms:
     00  atom: string once 'rule'
     01  atom: string once 'rule'
     02  atom: string once 'specialRule'
     03  atom: string once 'rule'
     04  atom: string once 'rule'
    4 states: start: 0
     state: FINAL 0, 5 transitions:
      trans: atom 0, to 1
      trans: removed
      trans: atom 1, to 1
      trans: atom 2, to 2
      trans: atom 4, to 3
     state: FINAL 1, 4 transitions:
      trans: atom 1, to 1
      trans: atom 2, to 2
      trans: removed
      trans: atom 4, to 3
     state: FINAL 2, 3 transitions:
      trans: atom 3, to 3
      trans: removed
      trans: atom 4, to 3
     state: FINAL 3, 1 transitions:
      trans: atom 4, to 3
    0 counters:
    Final: 4 states
    Final: 2 atoms
    Indet: state 0 trans 4, atom 4 to 3 : 0 to 3
           previous to is 2
    tst.xsd:4: element complexType: Schemas parser error : local complex
    type: The content model is not determinist.
    ----------------

5/ make the graph of the automata as it is created and then transformed
   by epsilon rules reductions, c.f. the picture P1020568.JPG attached
   something is wrong when adding the 3rd epsilon transition from state
   1 to state 3, as the indeterminism is created there, suddenly a rule
   while in state 0 or 1 could be consumed to go to 1 or to 3.

6/ there is a need to separate the state from the end of the sequence
   from state 3 to avoid the problem

7/ make the patch adding an extra state at end of sequences in that
   configuration and connected to the last state of the sequence by
   an espilon

8/ notice we did just that in another sequence case with a nice comment
   copy the comment too

9/ rebuild xmllint, and rerun, watch the new automata being generated
   and the output, looks it fixes the issue, good !

10/ recomment the verbose debug macros, rebuild everything, watch 'make check'
    with a bit of anxiety

11/ run runsuite for good measure, no divergence, so one can run
    30000+ tests without hitting such a basic regexp construction
    error. Actually that had been around for nearly a decade without
    anybody noticing, isn't code amazing !

12/ Prepare to push to git

13/ prepare a mail to the list with explanations.

14/ commit, push and send mail

Daniel

--
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Attachment: P1020568.JPG
Description: JPEG image

Attachment: tmp
Description: Text document



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]