[xml] Latest optimisation: XML_PARSE_COMPACT
- From: Daniel Veillard <veillard redhat com>
- To: xml gnome org
- Subject: [xml] Latest optimisation: XML_PARSE_COMPACT
- Date: Thu, 25 Aug 2005 11:35:41 -0400
Hi all,
This should affect users of the tree and the reader parser APIs.
This comes from the fact that when genrating a text node there is a lot
of allocated memory wasted in the node structure and the fact that
memory allocation is now the main bottleneck of libxml2 parsing
speed when not using SAX.
The principle of the optimization is that for small string, then
just keep the node value within the node structure, by using the
two consecutive locations for pointers properties and nsDef, which
obviously should not be used for text nodes. However it does break
code:
- which modifies the node content without going though API
or
- which don't check the node type before accessing properties
or nsDef string
This of course was breaking libxml2 and libxslt, so I cleaned up the
related code, but I assume it will also break some user applications so
this can't be made a default behaviour, hence the new XML_PARSE_COMPACT
parser option !
I made the reader use it by default as the user application should never
modify the reader private tree, this can lead to significant improvements
for regular data like kind of content. Note that the change is more
interesting on 64bit boxes as it will be able to cache all strings shorter
than 16bytes instead of all strings shorter than 8 bytes on 32bits.
Let see from an example how this works:
paphio:~/XML -> cat tst.xml
<doc>
<a>
<b attr="hello">folks</b>
<c>too long a string for compact</c>
</a>
</doc>
paphio:~/XML -> xmllint --debug tst.xml
DOCUMENT
version=1.0
URL=tst.xml
standalone=true
ELEMENT doc
TEXT compact <- formatting blank 5 chars so compacted
content=
ELEMENT a
TEXT interned <- formatting blank 9 chars so in the dictionnary
content=
ELEMENT b
ATTRIBUTE attr
TEXT compact <- the attribute value is 5 chars so compacted
content=hello
TEXT compact <- the text content is 5 chars so compacted
content=folks
TEXT compact <- formatting blank 9 chars so in the dictionnary
content=
ELEMENT c
TEXT <- that one is too long not compacted
content=too long a string for compact
TEXT compact <- formatting blank 5 chars so compacted
content=
TEXT compact <- formatting blank 5 chars so compacted
content=
paphio:~/XML ->
This lead to 2 remarks:
- formatting blank are a finite set so they are always interned in the
document dictionnary
- the options --nodict and --nocompact of xmllint allows to see the
impact on an instance
The impact can be drastic on some data oriented files:
before:
localhost:~/XML -> ltrace ./xmllint --stream test/att4 2>res;grep malloc res|wc -l
14642
after:
paphio:~/XML -> ltrace ./xmllint --stream test/att4 2>res;grep malloc res|wc -l 430
Or nearly neglectible on others:
before:
localhost:~/XML -> ltrace ./xmllint --stream test/valid/REC-xml-19980210.xml 2>res;grep malloc res|wc -l
13218
after:
paphio:~/XML -> ltrace ./xmllint --stream test/valid/REC-xml-19980210.xml 2>res;grep malloc res|wc -l
12326
after on a 64bit box:
localhost:~/XML -> ltrace ./xmllint --stream test/valid/REC-xml-19980210.xml 2>res;grep malloc res|wc -l
11081
The usefulness of this option is really on a case by case basis, remember
you will have to activate it in your apps (after checking :-).
This has been commited to CVS, the CVS snapshot on ftp://xmlsoft.org/
should have it too this should not break applications, but
well I prefer to give some advance warning so that people can check
ahead of time before the next release (probably end of next week).
Daniel
--
Daniel Veillard | Red Hat Desktop team http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]