[xml] Better hash function for dict.c

From: Stefan Behnel <stefan_ml behnel de>
To: xml gnome org
Subject: [xml] Better hash function for dict.c
Date: Wed, 16 Apr 2008 22:53:04 +0200

Hi,

long mail, bottom line being: 30% to multiple times faster parsing with a
different hash function in dict.c. Keep reading.

I did some profiling on lxml using callgrind. lxml configures the parser to
use a global per-thread dictionary, so everything it parses ends up in one
dictionary instead of tons of calls to malloc(). That by itself is quite an
impressive performance improvement.

What I found in callgrind was that a huge part of the overall time is spent in
xmlDictLookup. And that this time is almost completely wasted on calls to
memcmp(), i.e. in walking through the content of one hash bucket and comparing
strings.

Next, I did some testing on the current hash function (against
/usr/share/dict/words). For small dict sizes (64,128,256), the distribution is
fine. However, it seems that the distribution degrades with increasing size of
the hash table. Given a dict size of 8192, for example, the "words" file leads
to 80% of the buckets being larger than 4, but with an almost linearly
increasing distribution. To make things worse, a larger dict size is
encouraged by the size maintenance algorithm, which increases the size when it
has to look through more than 4 values sequentially (if I understood that
correctly). According to the profiling data, there appears to be a 50% chance
of having to grow the dictionary when a new entry gets added to the dict in
xmlDictLookup().

A little web search offers a couple of completely unreadable hash functions
that come with some obviously biased benchmarks. :) However, this one

http://www.azillionmonkeys.com/qed/hash.html

is quite short and readable and seems to do what I was looking for. In a (non
representative) benchmark, I get a speedup of a factor of 7 in lxml when
parsing from memory ("bench_XML" benchmark in lxml's bench_etree.py suite).
That's from over 300 msecs down to 40!

However, that benchmark uses generated tag names, so it's not necessarily
representative for an average XML file. More realistic benchmarks, like
parsing and XPath searching an XML file containing the old testament, run
between 2% and 30% faster. Also, the number of cases where the dictionary size
is checked for an increase (i.e. where there were more than 4 entries in a
bucket) drop quite dramatically according to callgrind, from 49% to below 2%.
Hmmm, I have difficulties in believing those numbers myself...

I attached a patch against libxml2 2.6.32 that replaces the current
xmlDictComputeKey() implementation with the SuperFastHash() function from the
web page.

Note that this is not a production ready patch, not even a "ready for
inclusion" patch. It is just meant to let others give it a try to see if they
get similar results. The problem with hash functions is that they may work
great for some input but worse for others. This hash function is slower than
the current one, actually a lot slower. But the achieved hash distribution
seems to be so much better that it wins the contest by a long way. And there
may still be space left for improvements before the final inclusion.

So I would really like to get some feedback from others.

Stefan

--- dict.c.ORIG 2008-04-16 21:59:19.000000000 +0200
+++ dict.c      2008-04-16 20:56:16.000000000 +0200
@@ -222,34 +222,72 @@
     return(ret);
 }
 
+
+#include <stdint.h>
+#undef get16bits
+#if (defined(__GNUC__) && defined(__i386__)) || defined(__WATCOMC__) \
+  || defined(_MSC_VER) || defined (__BORLANDC__) || defined (__TURBOC__)
+#define get16bits(d) (*((const uint16_t *) (d)))
+#endif
+
+#if !defined (get16bits)
+#define get16bits(d) ((((uint32_t)(((const uint8_t *)(d))[1])) << 8)\
+                       +(uint32_t)(((const uint8_t *)(d))[0]) )
+#endif
+
+static uint32_t
+SuperFastHash (const xmlChar* data, int len, uint32_t hash) {
+    uint32_t tmp;
+    int rem;
+
+    if (len <= 0 || data == NULL) return hash;
+
+    rem = len & 3;
+    len >>= 2;
+
+    /* Main loop */
+    for (;len > 0; len--) {
+        hash  += get16bits (data);
+        tmp    = (get16bits (data+2) << 11) ^ hash;
+        hash   = (hash << 16) ^ tmp;
+        data  += 2*sizeof (uint16_t);
+        hash  += hash >> 11;
+    }
+
+    /* Handle end cases */
+    switch (rem) {
+        case 3: hash += get16bits (data);
+                hash ^= hash << 16;
+                hash ^= data[sizeof (uint16_t)] << 18;
+                hash += hash >> 11;
+                break;
+        case 2: hash += get16bits (data);
+                hash ^= hash << 11;
+                hash += hash >> 17;
+                break;
+        case 1: hash += *data;
+                hash ^= hash << 10;
+                hash += hash >> 1;
+    }
+
+    /* Force "avalanching" of final 127 bits */
+    hash ^= hash << 3;
+    hash += hash >> 5;
+    hash ^= hash << 4;
+    hash += hash >> 17;
+    hash ^= hash << 25;
+    hash += hash >> 6;
+
+    return hash;
+}
+
 /*
  * xmlDictComputeKey:
  * Calculate the hash key
  */
 static unsigned long
 xmlDictComputeKey(const xmlChar *name, int namelen) {
-    unsigned long value = 0L;
-    
-    if (name == NULL) return(0);
-    value = *name;
-    value <<= 5;
-    if (namelen > 10) {
-        value += name[namelen - 1];
-        namelen = 10;
-    }
-    switch (namelen) {
-        case 10: value += name[9];
-        case 9: value += name[8];
-        case 8: value += name[7];
-        case 7: value += name[6];
-        case 6: value += name[5];
-        case 5: value += name[4];
-        case 4: value += name[3];
-        case 3: value += name[2];
-        case 2: value += name[1];
-        default: break;
-    }
-    return(value);
+  return SuperFastHash(name, namelen, namelen);
 }
 
 /*
@@ -259,56 +297,14 @@
 static unsigned long
 xmlDictComputeQKey(const xmlChar *prefix, const xmlChar *name, int len)
 {
-    unsigned long value = 0L;
     int plen;
     
     if (prefix == NULL)
-        return(xmlDictComputeKey(name, len));
+        return(SuperFastHash(name, len, len));
 
     plen = xmlStrlen(prefix);
-    if (plen == 0)
-       value += 30 * (unsigned long) ':';
-    else
-       value += 30 * (*prefix);
-    
-    if (len > 10) {
-        value += name[len - (plen + 1 + 1)];
-        len = 10;
-       if (plen > 10)
-           plen = 10;
-    }
-    switch (plen) {
-        case 10: value += prefix[9];
-        case 9: value += prefix[8];
-        case 8: value += prefix[7];
-        case 7: value += prefix[6];
-        case 6: value += prefix[5];
-        case 5: value += prefix[4];
-        case 4: value += prefix[3];
-        case 3: value += prefix[2];
-        case 2: value += prefix[1];
-        case 1: value += prefix[0];
-        default: break;
-    }
-    len -= plen;
-    if (len > 0) {
-        value += (unsigned long) ':';
-       len--;
-    }
-    switch (len) {
-        case 10: value += name[9];
-        case 9: value += name[8];
-        case 8: value += name[7];
-        case 7: value += name[6];
-        case 6: value += name[5];
-        case 5: value += name[4];
-        case 4: value += name[3];
-        case 3: value += name[2];
-        case 2: value += name[1];
-        case 1: value += name[0];
-        default: break;
-    }
-    return(value);
+    return SuperFastHash(name, len,
+                        SuperFastHash(prefix, plen, plen));
 }
 
 /**

Follow-Ups:
- Re: [xml] Better hash function for dict.c
  - From: Daniel Veillard
- Re: [xml] Better hash function for dict.c
  - From: Stefan Behnel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]