[xml] dtd and relaxng performance
- From: Petr Pajas <pajas ufal ms mff cuni cz>
- To: libxml2 <xml gnome org>
- Subject: [xml] dtd and relaxng performance
- Date: Tue, 17 Feb 2004 16:09:38 +0100
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Daniel, All,
I realized that as my XML file grows, the time needed by xmllint to
validate it against a (fixed) DTD schema increases in a much steeper
proportion than O(n) which is what I would expect from an
automata-based implementation (rather O(n^2)). I did the same test
with RelaxNG (using trang to convert dtd to rng). The result was
little better, but from the complexity proportion nearly the
same. Surprisingly, James Clark's Java tool jing did the relaxng
validation much faster on large documents and in liner proportion to
the size of a document.
The following listing compares the times needed by xmllint to validate
6 documents, where n'th document has approx n hundred thousands nodes.
Columns with * contain a ratio of the preceding column to the
value in that column for the 1st file, so you see, that 6.xml has
5.6 times as many nodes as 1.xml but the validation takes 111 times
longer. The situation is slightly better for rng.
file elems nodes * parse * dtd * rng * jing *
1.xml 50903 131091 1.0 295 1.0 528 1.0 1253 1.0 1087 1.0
2.xml 95442 246427 1.9 527 1.8 3291 6.2 4515 3.6 1567 1.4
3.xml 140034 361728 2.8 761 2.6 10957 20.8 13108 10.5 2122 2.0
4.xml 187177 483129 3.7 1007 3.4 22060 41.8 27174 21.7 2609 2.4
5.xml 234940 606493 4.6 1258 4.3 38247 72.4 43041 34.4 3117 2.9
6.xml 282620 729674 5.6 1842 6.2 58986 111.7 63793 50.9 3714 3.4
The files used are "dictionary-like" XML files taken from a real
application. It is a research-related data, so I replaced most usable
values in it but preserved the structure. If you wish, you can get the
files and the DTD from http://pajas.matfyz.cz/dtdvalid-test.zip
For DTD validation I used xmllint --timing --notime with either
- --dtdvalid or --relaxng and put the parsing and validating times to
parse, dtd and rng columns. Parsing and compilation of the DTD was so
quick that it's not worth mentioning.
I measured jing with simple 'time java -jar jing.jar', so it includes
some time to load java VM, etc. That's probably why the ratio is even
better then O(n).
My question is if you are aware of something in the DTD/RelaxNG
validation code that could make it scale O(n^2) or so.
/ Petr
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>
iD8DBQFAMi6yQfxdLDi03+IRArflAJ4lElDkVi16Y4DcrGgUE5sd579JjQCfSXyx
uTF3wdhGgvSXgETF38bCTs4=
=Oyef
-----END PGP SIGNATURE-----
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]