[xml] Latest optimisation: XML_PARSE_COMPACT

From: Daniel Veillard <veillard redhat com>
To: xml gnome org
Subject: [xml] Latest optimisation: XML_PARSE_COMPACT
Date: Thu, 25 Aug 2005 11:35:41 -0400
 Hi all,

This should affect users of the tree and the reader parser APIs.
This comes from the fact that when genrating a text node there is a lot
of allocated memory wasted in the node structure and the fact that 
memory allocation is now the main bottleneck of libxml2 parsing
speed when not using SAX.
The principle of the optimization is that for small string, then
just keep the node value within the node structure, by using the
two consecutive locations for pointers properties and nsDef, which
obviously should not be used for text nodes. However it does break
code:
   - which modifies the node content without going though API
 or
   - which don't check the node type before accessing properties
     or nsDef string

 This of course was breaking libxml2 and libxslt, so I cleaned up the
related code, but I assume it will also break some user applications so
this can't be made a default behaviour, hence the new XML_PARSE_COMPACT
parser option !
 I made the reader use it by default as the user application should never
modify the reader private tree, this can lead to significant improvements
for regular data like kind of content. Note that the change is more
interesting on 64bit boxes as it will be able to cache all strings shorter
than 16bytes instead of all strings shorter than 8 bytes on 32bits.

Let see from an example how this works:

paphio:~/XML -> cat tst.xml
<doc>
    <a>
        <b attr="hello">folks</b>
        <c>too long a string for compact</c>
    </a>
</doc>
paphio:~/XML ->  xmllint --debug tst.xml
DOCUMENT
version=1.0
URL=tst.xml
standalone=true
  ELEMENT doc
    TEXT compact      <- formatting blank 5 chars so compacted
      content=
    ELEMENT a
      TEXT interned   <- formatting blank 9 chars so in the dictionnary
        content=
      ELEMENT b
        ATTRIBUTE attr
          TEXT compact    <- the attribute value is 5 chars so compacted
            content=hello
        TEXT compact      <- the text content is 5 chars so compacted
          content=folks
      TEXT compact    <- formatting blank 9 chars so in the dictionnary
        content=
      ELEMENT c
        TEXT          <- that one is too long not compacted
          content=too long a string for compact
      TEXT compact    <- formatting blank 5 chars so compacted
        content=
    TEXT compact      <- formatting blank 5 chars so compacted
      content=
paphio:~/XML ->

  This lead to 2 remarks:
   - formatting blank are a finite set so they are always interned in the
     document dictionnary
   - the options --nodict and --nocompact of xmllint allows to see the
     impact on an instance

The impact can be drastic on some data oriented files:

before:
localhost:~/XML -> ltrace ./xmllint --stream test/att4 2>res;grep malloc res|wc -l
14642
after:
paphio:~/XML -> ltrace ./xmllint --stream test/att4 2>res;grep malloc res|wc -l 430

Or nearly neglectible on others:

before:
localhost:~/XML -> ltrace ./xmllint --stream test/valid/REC-xml-19980210.xml 2>res;grep malloc res|wc -l
13218
after:
paphio:~/XML -> ltrace ./xmllint --stream test/valid/REC-xml-19980210.xml 2>res;grep malloc res|wc -l
12326
after on a 64bit box:
localhost:~/XML -> ltrace ./xmllint --stream test/valid/REC-xml-19980210.xml 2>res;grep malloc res|wc -l
11081


  The usefulness of this option is really on a case by case basis, remember
you will have to activate it in your apps (after checking :-).

  This has been commited to CVS, the CVS snapshot on ftp://xmlsoft.org/ 
should have it too this should not break applications, but
well I prefer to give some advance warning so that people can check 
ahead of time before the next release (probably end of next week).

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]