On Thu, Aug 23, 2012 at 09:03:44PM +0800, Daniel Veillard wrote:
On Wed, Aug 22, 2012 at 10:40:20PM +0200, Johan Corveleyn wrote:
[...]
Is there some (preferable concise) reading about basic concepts like how xml schemas automata work, how these are built, etc? What should the automata in this case look like (in terms of the data structures inside the code)? How can I see that the automata is non-determinist, and what would a determinist automata look like, ... ?Johan, sorry for the slow answer, I will look at this, promise, but all that code is rooted in automata theory, there is quite some litterature about it, and well unless you go though the process of learning that code and possibly the theory behind it, it's not something that can be hacked trivially. Basically I run this in debug mode, look at the automata generated, there is pro bably an error there and then this need fixing and a lot of checking. To be honnest that one of the hardest part of the library to change with the core of the parser. Theory is usually learnt though the 'Dragon Book', by Aho and Al. and the code is actually more complex since using extended automata operations not found in basic automata ... If you have time looking at this may be interesting, but it's really not trivial at least more complex than other parts of the code :-)
Okay simce people may get curious: 1/ I fixed the issue see the tiny patch attached 2/ following is an explanation of what I did to debug the issue in case people may want to fix other issues there of if they are just curious how a developper may spend half an hour of his time :-) Steps: 0/ make minimal xsd and xml the one given were quite good 1/ think about the pattern that is being expressed, it's rule * (specialRule rule *) ? that should be determinist actually so yes that looks like a bug 2/ run xmllint --schema tst.xsd tst.xml clearly libxml2 regexp think it's not determinist, and there is a construction problem as hinted before 3/ edit xmlregexp.c comment off the 4 debugging macros at the top recompile xmllint and run it on the example 4/ start to study the nice output given: ------------------------------- thinkpad:~/XML -> xmllint --schema tst.xsd tst.xml Add trans from 0 to 1 atom: string once 'rule' Add trans from 1 to 1 atom: string once 'rule' Add trans from 0 to 1 epsilon transition Add trans from 1 to 2 atom: string once 'specialRule' Add trans from 2 to 3 atom: string once 'rule' Add trans from 3 to 3 atom: string once 'rule' Add trans from 2 to 3 epsilon transition Add trans from 1 to 3 epsilon transition Found epsilon trans 1 from 2 to 3 xmlFAReduceEpsilonTransitions(2, 3) State 3 is final, so 2 becomes final Add trans from 2 to 3 atom: string once 'rule' Found epsilon trans 2 from 1 to 3 xmlFAReduceEpsilonTransitions(1, 3) State 3 is final, so 1 becomes final Add trans from 1 to 3 atom: string once 'rule' Found epsilon trans 1 from 0 to 1 xmlFAReduceEpsilonTransitions(0, 1) State 1 is final, so 0 becomes final Add trans from 0 to 1 atom: string once 'rule' Add trans from 0 to 2 atom: string once 'specialRule' Add trans from 0 to 3 atom: string once 'rule' xmlFAComputesDeterminism ctxt: '(null)' 5 atoms: 00 atom: string once 'rule' 01 atom: string once 'rule' 02 atom: string once 'specialRule' 03 atom: string once 'rule' 04 atom: string once 'rule' 4 states: start: 0 state: FINAL 0, 5 transitions: trans: atom 0, to 1 trans: removed trans: atom 1, to 1 trans: atom 2, to 2 trans: atom 4, to 3 state: FINAL 1, 4 transitions: trans: atom 1, to 1 trans: atom 2, to 2 trans: removed trans: atom 4, to 3 state: FINAL 2, 3 transitions: trans: atom 3, to 3 trans: removed trans: atom 4, to 3 state: FINAL 3, 1 transitions: trans: atom 4, to 3 0 counters: Final: 4 states Final: 2 atoms Indet: state 0 trans 4, atom 4 to 3 : 0 to 3 previous to is 2 tst.xsd:4: element complexType: Schemas parser error : local complex type: The content model is not determinist. ---------------- 5/ make the graph of the automata as it is created and then transformed by epsilon rules reductions, c.f. the picture P1020568.JPG attached something is wrong when adding the 3rd epsilon transition from state 1 to state 3, as the indeterminism is created there, suddenly a rule while in state 0 or 1 could be consumed to go to 1 or to 3. 6/ there is a need to separate the state from the end of the sequence from state 3 to avoid the problem 7/ make the patch adding an extra state at end of sequences in that configuration and connected to the last state of the sequence by an espilon 8/ notice we did just that in another sequence case with a nice comment copy the comment too 9/ rebuild xmllint, and rerun, watch the new automata being generated and the output, looks it fixes the issue, good ! 10/ recomment the verbose debug macros, rebuild everything, watch 'make check' with a bit of anxiety 11/ run runsuite for good measure, no divergence, so one can run 30000+ tests without hitting such a basic regexp construction error. Actually that had been around for nearly a decade without anybody noticing, isn't code amazing ! 12/ Prepare to push to git 13/ prepare a mail to the list with explanations. 14/ commit, push and send mail Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel veillard com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/
Attachment:
P1020568.JPG
Description: JPEG image
Attachment:
tmp
Description: Text document