Re: [xml] external DTD validation of large XML's



Jon <jon forums gmail com> writes:

I am new to the libxml2 api and am looking to use it to create a simple tool that can validate large xml 
files via external DTDs, and eventually XSDs. I've successfully built libxml2 on win7 using a mingw 
toolchain and plan to build the tool as a statically linked exe for windows.

I've found http://mail.gnome.org/archives/xml/2004-July/msg00055.html and 
http://mail.gnome.org/archives/xml/2009-November/msg00039.html and would appreciate pointers in the right 
direction, either sections in xmllint.c to review or ideas on how to use the Reader api to do this.

XMLStarlet does this too, maybe it will be useful for you:
http://xmlstar.git.sourceforge.net/git/gitweb.cgi?p=xmlstar/xmlstar;a=blob;f=src/xml_validate.c;hb=HEAD


I'm more concerned about memory usage and speed and have no preference between using the SAX2 or Reader 
apis.


After skimming xmllint.c I want to confirm that my understanding of the following is correct.

1) The only way to use xmllint to validate against an external DTD file is

   xmllint --dtdvalid luddite.dtd file1.xml file2.xml ...

and the following will not work as neither `testSAX()` nor `streamFile()` validate against an external DTD 
file:

   xmllint --sax --dtdvalid luddite.dtd file1.xml ...
   xmllint --stream --dtdvalid luddite.dtd file1.xml ...

Yes, as a consequence of 4).


2) Does the following mean that when using libxml2's SAX functionality a document representation of the 
entire input XML is created in memory?

   http://git.gnome.org/browse/libxml2/tree/xmllint.c#n1711

No, it depends on the handler in use. The code you reference there is
checking for unexpected creation of DOM tree: unexpected because neither
the emptySAXHandler nor the debugSAXHandler create a DOM tree.


3) As of v2.7.8 and using the Reader API, there is no way to validate using an external DTD similar to

   http://git.gnome.org/browse/libxml2/tree/xmllint.c#n1881
   http://git.gnome.org/browse/libxml2/tree/xmllint.c#n1896


Yes, see https://bugzilla.gnome.org/show_bug.cgi?id=169375


4) As of v2.7.8 and using the Reader API, there is no way to a posteriori validate using an external DTD 
similar the following. A posteriori DTD validation is only available after parsing a full DOM into memory.

   http://git.gnome.org/browse/libxml2/tree/xmllint.c#n2759

Yes, which in addition to the memory usage also has the problem that the
DOM structure uses 2 bytes to hold line numbers, so error messages don't
have the right line number after 65535.

https://bugzilla.gnome.org/show_bug.cgi?id=143739



If the above are correct, what do you suggest to people who want to use libxml2 to validate large XMLs with 
external DTD files?  Re-write the input XML file?

Pretty much yeah. It's not so bad, just a tiny DOCTYPE refering to the DTD.

Noam



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]