[xml] On processing large documents
- From: Daniel Veillard <veillard redhat com>
- To: xml gnome org
- Subject: [xml] On processing large documents
- Date: Thu, 17 Apr 2003 10:56:58 -0400
FYI, the version of libxml2 in CVS have what I consider serious
functional improvements, especially when manipulating large documents.
Those are built on top of the xmlReaderInterface:
- First I augmented the reader API to be able to expand a given node subtree
getting back an usual xmlNodePtr. This allows to scan large database files
for pieces of informations and once found extract and format the information
using the usual interfaces of the tree and XPath APIs
http://xmlsoft.org/xmlreader.html#Mixing
- Second I glued together the RelaxNG validation code and the reader API,
allowing to do some thorough validation of large documents who can't fit
into memory, basically if the RelaxNG content model of an element can be
expressed as a deterministic regexp then the validator can stream on those
elements:
http://xmlsoft.org/xmlreader.html#L1142
This has been added to the xmllint tool when using the combinaison of
--stream and --relaxng flags, here is a simple example:
paphio:~/XML -> /usr/bin/xmllint --stream --timing --noout --relaxng db.rng ~/XSLT/tests/XSLTMark/db10000.xml
Compiling the schemas took 1 ms
Parsing and validating took 726 ms
/u/veillard/XSLT/tests/XSLTMark/db10000.xml validates
paphio:~/XML -> cat .memdump
04:13:32 PM
MEMORY ALLOCATED : 0, MAX was 89942
BLOCK NUMBER SIZE TYPE
paphio:~/XML -> ls -l /u/veillard/XSLT/tests/XSLTMark/db10000.xml
-rw-rw-r-- 1 veillard www 2009240 Mar 26 00:27 /u/veillard/XSLT/tests/XSLTMark/db10000.xml
paphio:~/XML -> cat db.rng
<element name="table" xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<oneOrMore>
<element name="row">
<element name="id">
<data type="integer"/>
</element>
<element name="firstname">
<data type="NCName"/>
</element>
<element name="lastname">
<data type="NCName"/>
</element>
<element name="street">
<text/>
</element>
<element name="city">
<data type="NCName"/>
</element>
<element name="state">
<data type="string">
<param name="length">2</param>
</data>
</element>
<element name="zip">
<data type="integer"/>
</element>
</element>
</oneOrMore>
</element>
paphio:~/XML -> more /u/veillard/XSLT/tests/XSLTMark/db10000.xml
<?xml version="1.0"?>
<table>
<row>
<id>0000</id>
<firstname>Al</firstname>
<lastname>Aranow</lastname>
<street>1 Any St.</street>
<city>Anytown</city>
<state>AL</state>
<zip>22000</zip>
</row>
<row>
<id>0001</id>
<firstname>Bob</firstname>
<lastname>Aranow</lastname>
<street>2 Any St.</street>
<city>Anytown</city>
<state>AL</state>
<zip>22000</zip>
</row>
[...]
libxml2 is now able to validate the 10,000 records using Relax-NG and
XML Schemas datatypes while requiring constant memory (a litlle under 100KB)
and at the speed of approximately 3MBytes/s .
I expect to ship libxml2-2.5.7 with those enhancements once I have fixed
some of the bugs waiting on bugzilla.
Daniel
--
Daniel Veillard | Red Hat Network https://rhn.redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]