Re: [xml] Better hash function for dict.c

From: Daniel Veillard <veillard redhat com>
To: Stefan Behnel <stefan_ml behnel de>
Cc: xml gnome org
Subject: Re: [xml] Better hash function for dict.c
Date: Thu, 17 Apr 2008 10:05:03 -0400

On Wed, Apr 16, 2008 at 10:53:04PM +0200, Stefan Behnel wrote:

Hi,


  Hi Stefan,

long mail, bottom line being: 30% to multiple times faster parsing with a
different hash function in dict.c. Keep reading.

I did some profiling on lxml using callgrind. lxml configures the parser to
use a global per-thread dictionary, so everything it parses ends up in one
dictionary instead of tons of calls to malloc(). That by itself is quite an
impressive performance improvement.


  I think it's one of the danger of the approach taken in lxml to reuse
the same dictionnary for all parsing (at least in a thread), as the 
dictionary can only grow and this leads to false expectations of performances
which may just blow up when your module has run for a few million documents.
The hash function is just a small factor of the performances, reusing the
same dict over and over should IMHO be done only if you are parsing only
one kind of document, basically working on a fixed size vocabulary.

  I don't see how to fix this problem in general for an infinite vocabulary.

  I'm not sure from the description if you used the xmlDict for anything out
side of what libxml2 uses it for, i.e. markup, indentations, and tiny strings.
I hope you didn't tried to use it for the content, because that would be indeed
a performance disaster in practice as random strings gets added to the dict,
you would loose all the performance gain of the dictionary.

What I found in callgrind was that a huge part of the overall time is spent in
xmlDictLookup. And that this time is almost completely wasted on calls to
memcmp(), i.e. in walking through the content of one hash bucket and comparing
strings.

Next, I did some testing on the current hash function (against
/usr/share/dict/words). For small dict sizes (64,128,256), the distribution is
fine. However, it seems that the distribution degrades with increasing size of
the hash table. Given a dict size of 8192, for example, the "words" file leads
to 80% of the buckets being larger than 4, but with an almost linearly
increasing distribution. To make things worse, a larger dict size is
encouraged by the size maintenance algorithm, which increases the size when it
has to look through more than 4 values sequentially (if I understood that
correctly).


  yes to avoid getting into linear speed the dictionary size increases if it
find a clash list longer than a predefined size (might be 4 can't remember 
right now). This is costly but in general since you reuse most of the words
found in the markup this is worth the experiments.
  For the profiling I used real case XMLs with very large markups like TEI
(and DocBook for a smaller set) DTDs and used callgrind too at the time 
(around 2.5.0).

According to the profiling data, there appears to be a 50% chance
of having to grow the dictionary when a new entry gets added to the dict in
xmlDictLookup().

A little web search offers a couple of completely unreadable hash functions
that come with some obviously biased benchmarks. :) However, this one

http://www.azillionmonkeys.com/qed/hash.html

is quite short and readable and seems to do what I was looking for. In a (non
representative) benchmark, I get a speedup of a factor of 7 in lxml when
parsing from memory ("bench_XML" benchmark in lxml's bench_etree.py suite).
That's from over 300 msecs down to 40!


  I would prefer to see benchmarks done with xmllint directly, to avoid
side effect of more string interning than libxml2.

However, that benchmark uses generated tag names, so it's not necessarily
representative for an average XML file. More realistic benchmarks, like
parsing and XPath searching an XML file containing the old testament, run
between 2% and 30% faster. Also, the number of cases where the dictionary size
is checked for an increase (i.e. where there were more than 4 entries in a
bucket) drop quite dramatically according to callgrind, from 49% to below 2%.
Hmmm, I have difficulties in believing those numbers myself...

I attached a patch against libxml2 2.6.32 that replaces the current
xmlDictComputeKey() implementation with the SuperFastHash() function from the
web page.

Note that this is not a production ready patch, not even a "ready for
inclusion" patch. It is just meant to let others give it a try to see if they
get similar results. The problem with hash functions is that they may work
great for some input but worse for others. This hash function is slower than
the current one, actually a lot slower. But the achieved hash distribution
seems to be so much better that it wins the contest by a long way. And there
may still be space left for improvements before the final inclusion.

So I would really like to get some feedback from others.


  I'm not against changing the hash function if one can prove it works well
in the libxml2 context and not just lxml one ;-) . Basically I'm afraid of
changes which would try to fix an abuse of the dictionary but might penalize
the normal libxml2 users.
  If you tried to intern all strings passed to the user in lxml, I really think
you need to revisit this because that's untenable, I hope I misundestood
(very probable ;-) . And I'm all for improving the hash function, I wrote it
at the time after finding a number of 'classic' algorithms had actually
abominable behaviours in libxml2 use, it's certainly not perfect. I just want
the evaluation to be done in a pure libxml2 context, as long as there is 
no degradation, I'm fine with this. Of course running the full regression
tests for libxml2 and libxslt without problems is important before any patch
is accepted too, but hash function chnages should only affect performances.

  Since you seems to be interested in the performances of the hash 
algorithm, I tried to drop the string comparisons on lookup when possible
I have an old patch for this which I'm enclosing, but I never applied it
since I had problems at the time (can't remember why/where, it's just 
a FYI patch ;-)

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/

Follow-Ups:
- Re: [xml] Better hash function for dict.c
  - From: Daniel Veillard
- Re: [xml] Better hash function for dict.c
  - From: Stefan Behnel
- Re: [xml] Better hash function for dict.c
  - From: Stefan Behnel

References:
- [xml] Better hash function for dict.c
  - From: Stefan Behnel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]