[xml] On processing large documents



  FYI, the version of libxml2 in CVS have what I consider serious
functional improvements, especially when manipulating large documents.
Those are built on top of the xmlReaderInterface:

- First I augmented the reader API to be able to expand a given node subtree
  getting back an usual xmlNodePtr. This allows to scan large database files
  for pieces of informations and once found extract and format the information
  using the usual interfaces of the tree and XPath APIs
  http://xmlsoft.org/xmlreader.html#Mixing

- Second I glued together the RelaxNG validation code and the reader API,
  allowing to do some thorough validation of large documents who can't fit
  into memory, basically if the RelaxNG content model of an element can be
  expressed as a deterministic regexp then the validator can stream on those
  elements:
  http://xmlsoft.org/xmlreader.html#L1142

This has been added to the xmllint tool when using the combinaison of
--stream and --relaxng flags, here is a simple example:

paphio:~/XML -> /usr/bin/xmllint --stream --timing --noout  --relaxng db.rng ~/XSLT/tests/XSLTMark/db10000.xml
Compiling the schemas took 1 ms
Parsing and validating took 726 ms
/u/veillard/XSLT/tests/XSLTMark/db10000.xml validates
paphio:~/XML -> cat .memdump
      04:13:32 PM
 
      MEMORY ALLOCATED : 0, MAX was 89942
BLOCK  NUMBER   SIZE  TYPE
paphio:~/XML -> ls -l /u/veillard/XSLT/tests/XSLTMark/db10000.xml
-rw-rw-r--    1 veillard www       2009240 Mar 26 00:27 /u/veillard/XSLT/tests/XSLTMark/db10000.xml
paphio:~/XML -> cat db.rng
<element name="table" xmlns="http://relaxng.org/ns/structure/1.0";
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes";>
  <oneOrMore>
    <element name="row">
      <element name="id">
        <data type="integer"/>
      </element>
      <element name="firstname">
        <data type="NCName"/>
      </element>
      <element name="lastname">
        <data type="NCName"/>
      </element>
      <element name="street">
        <text/>
      </element>
      <element name="city">
        <data type="NCName"/>
      </element>
      <element name="state">
        <data type="string">
          <param name="length">2</param>
        </data>
      </element>
      <element name="zip">
        <data type="integer"/>
      </element>
    </element>
  </oneOrMore>
</element>
paphio:~/XML -> more /u/veillard/XSLT/tests/XSLTMark/db10000.xml
<?xml version="1.0"?>
 
<table>
  <row>
    <id>0000</id>
    <firstname>Al</firstname>
    <lastname>Aranow</lastname>
    <street>1 Any St.</street>
    <city>Anytown</city>
    <state>AL</state>
    <zip>22000</zip>
  </row>
  <row>
    <id>0001</id>
    <firstname>Bob</firstname>
    <lastname>Aranow</lastname>
    <street>2 Any St.</street>
    <city>Anytown</city>
    <state>AL</state>
    <zip>22000</zip>
  </row>
[...]

  libxml2 is now able to validate the 10,000 records using Relax-NG and
XML Schemas datatypes while requiring constant memory (a litlle under 100KB)
and at the speed of approximately 3MBytes/s .
  I expect to ship libxml2-2.5.7 with those enhancements once I have fixed
some of the bugs waiting on bugzilla.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]