[xml] dtd and relaxng performance

Hash: SHA1

Hi Daniel, All,

I realized that as my XML file grows, the time needed by xmllint to
validate it against a (fixed) DTD schema increases in a much steeper
proportion than O(n) which is what I would expect from an
automata-based implementation (rather O(n^2)). I did the same test
with RelaxNG (using trang to convert dtd to rng). The result was
little better, but from the complexity proportion nearly the
same. Surprisingly, James Clark's Java tool jing did the relaxng
validation much faster on large documents and in liner proportion to
the size of a document.

The following listing compares the times needed by xmllint to validate 
6 documents, where n'th document has approx n hundred thousands nodes.

Columns with * contain a ratio of the preceding column to the
value in that column for the 1st file, so you see, that 6.xml has
5.6 times as many nodes as 1.xml but the validation takes 111 times
longer. The situation is slightly better for rng. 

file   elems   nodes    * parse    *    dtd     *   rng    *  jing    *

1.xml  50903  131091  1.0   295  1.0    528   1.0  1253  1.0  1087  1.0
2.xml  95442  246427  1.9   527  1.8   3291   6.2  4515  3.6  1567  1.4
3.xml 140034  361728  2.8   761  2.6  10957  20.8 13108 10.5  2122  2.0
4.xml 187177  483129  3.7  1007  3.4  22060  41.8 27174 21.7  2609  2.4
5.xml 234940  606493  4.6  1258  4.3  38247  72.4 43041 34.4  3117  2.9
6.xml 282620  729674  5.6  1842  6.2  58986 111.7 63793 50.9  3714  3.4

The files used are "dictionary-like" XML files taken from a real
application. It is a research-related data, so I replaced most usable
values in it but preserved the structure. If you wish, you can get the
files and the DTD from http://pajas.matfyz.cz/dtdvalid-test.zip

For DTD validation I used xmllint --timing --notime with either
- --dtdvalid or --relaxng and put the parsing and validating times to
parse, dtd and rng columns. Parsing and compilation of the DTD was so
quick that it's not worth mentioning.

I measured jing with simple 'time java -jar jing.jar', so it includes
some time to load java VM, etc. That's probably why the ratio is even
better then O(n).

My question is if you are aware of something in the DTD/RelaxNG
validation code that could make it scale O(n^2) or so.

/ Petr 
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]