Re: [xml] Better hash function for dict.c

From: Stefan Behnel <stefan_ml behnel de>
To: veillard redhat com
Cc: xml gnome org
Subject: Re: [xml] Better hash function for dict.c
Date: Sun, 20 Apr 2008 10:51:38 +0200

Hi again (and sorry for all the noise),

Stefan Behnel wrote:

If an application benefits from a different hash function depends on the
vocabulary it uses in its XML files. A slow but well distributing hash
function performs much better for large vocabularies (or many different
vocabularies), while small vocabularies will not fill the dict enough to make
a difference, in which case the faster hash function wins.


So the obvious solution is to combine the two. Here is a patch that uses the
original hash function to start with (but lowers the bucket fill limit a
little from 4 down to 3) and when it reaches the grow barrier for the first
time, switches to the new hash function. You will find a performance
comparison below, based on xmllint.

I decreased the bucket fill barrier for two reasons: to trigger an early
switch between the two functions, and because the second function has much
better load balancing, so a high bucket size in one place really means that
most buckets are at least close to that fill rate. As you can see from the
numbers, it works pretty well over a wide range of vocabulary sizes from small
to medium, and as I've shown before, it performs much (much!) better for
larger sizes.

BTW, Bob Jenkins did a comparison of a couple of hash functions, including the
additive hash (a variant of which is currently used) and the hash function
used in the patch.

http://burtleburtle.net/bob/hash/doobs.html

The hash function itself was written by Paul Hsieh and published on his web
site. According to Bob Jenkins, it's public domain (although I didn't ask
directly).

http://www.azillionmonkeys.com/qed/hash.html

Any objections to getting this patch merged?

Stefan


# original hash function, increasing number of tag names
 5: 100 iterations took 720 ms
10: 100 iterations took 720 ms
15: 100 iterations took 718 ms
20: 100 iterations took 724 ms
25: 100 iterations took 739 ms
30: 100 iterations took 735 ms
35: 100 iterations took 750 ms
40: 100 iterations took 748 ms
45: 100 iterations took 760 ms
50: 100 iterations took 766 ms
55: 100 iterations took 775 ms
60: 100 iterations took 778 ms
65: 100 iterations took 789 ms
70: 100 iterations took 782 ms
75: 100 iterations took 840 ms
80: 100 iterations took 813 ms
85: 100 iterations took 829 ms
90: 100 iterations took 821 ms
95: 100 iterations took 822 ms

# combined hash functions, increasing number of tag names
 5: 100 iterations took 725 ms
10: 100 iterations took 718 ms
15: 100 iterations took 723 ms
20: 100 iterations took 717 ms
25: 100 iterations took 742 ms
30: 100 iterations took 764 ms
35: 100 iterations took 773 ms
40: 100 iterations took 743 ms
45: 100 iterations took 762 ms
50: 100 iterations took 766 ms
55: 100 iterations took 768 ms
60: 100 iterations took 778 ms
65: 100 iterations took 741 ms
70: 100 iterations took 757 ms
75: 100 iterations took 742 ms
80: 100 iterations took 743 ms
85: 100 iterations took 757 ms
90: 100 iterations took 741 ms
95: 100 iterations took 743 ms

--- libxml2-2.6.32-orig/dict.c  2008-02-08 10:52:13.000000000 +0100
+++ libxml2-2.6.32/dict.c       2008-04-20 10:30:00.000000000 +0200
@@ -20,16 +20,30 @@
 #include "libxml.h"
 
 #include <string.h>
+#include <stdint.h>
 #include <libxml/tree.h>
 #include <libxml/dict.h>
 #include <libxml/xmlmemory.h>
 #include <libxml/xmlerror.h>
 #include <libxml/globals.h>
 
-#define MAX_HASH_LEN 4
+#define MAX_HASH_LEN 3
 #define MIN_DICT_SIZE 128
 #define MAX_DICT_HASH 8 * 2048
 
+#define xmlDictComputeKey(dict, name, len)             \
+    (((dict)->size == MIN_DICT_SIZE) ?                \
+     xmlDictComputeFastKey(name, len) :               \
+     xmlDictComputeBigKey(name, len, len))            \
+
+#define xmlDictComputeQKey(dict, prefix, name, len)                 \
+    (((prefix) == NULL) ?                                           \
+      (xmlDictComputeKey(dict, name, len)) :                        \
+      (((dict)->size == MIN_DICT_SIZE) ?                            \
+       xmlDictComputeFastQKey(prefix, name, len) :                  \
+       xmlDictComputeBigKey(prefix, xmlStrlen(prefix),              \
+                           xmlDictComputeBigKey(name, len, len))))
+
 /* #define ALLOW_REMOVAL */
 /* #define DEBUG_GROW */
 
@@ -223,11 +237,80 @@
 }
 
 /*
- * xmlDictComputeKey:
- * Calculate the hash key
+ * xmlDictComputeBigKey:
+ *
+ * Calculate a hash key using a good hash function that works well for
+ * larger hash table sizes.
+ *
+ * Hash function by Paul Hsieh, see
+ * http://www.azillionmonkeys.com/qed/hash.html
+ * http://burtleburtle.net/bob/hash/doobs.html
+ */
+#undef get16bits
+#if (defined(__GNUC__) && defined(__i386__)) || defined(__WATCOMC__) \
+  || defined(_MSC_VER) || defined (__BORLANDC__) || defined (__TURBOC__)
+#define get16bits(d) (*((const uint16_t *) (d)))
+#endif
+
+#if !defined (get16bits)
+#define get16bits(d) ((((uint32_t)(((const uint8_t *)(d))[1])) << 8)\
+                       +(uint32_t)(((const uint8_t *)(d))[0]) )
+#endif
+
+static uint32_t
+xmlDictComputeBigKey(const xmlChar* data, int len, uint32_t hash) {
+    uint32_t tmp;
+    int rem;
+
+    if (len <= 0 || data == NULL) return hash;
+
+    rem = len & 3;
+    len >>= 2;
+
+    /* Main loop */
+    for (;len > 0; len--) {
+        hash  += get16bits (data);
+        tmp    = (get16bits (data+2) << 11) ^ hash;
+        hash   = (hash << 16) ^ tmp;
+        data  += 2*sizeof (uint16_t);
+        hash  += hash >> 11;
+    }
+
+    /* Handle end cases */
+    switch (rem) {
+        case 3: hash += get16bits (data);
+                hash ^= hash << 16;
+                hash ^= data[sizeof (uint16_t)] << 18;
+                hash += hash >> 11;
+                break;
+        case 2: hash += get16bits (data);
+                hash ^= hash << 11;
+                hash += hash >> 17;
+                break;
+        case 1: hash += *data;
+                hash ^= hash << 10;
+                hash += hash >> 1;
+    }
+
+    /* Force "avalanching" of final 127 bits */
+    hash ^= hash << 3;
+    hash += hash >> 5;
+    hash ^= hash << 4;
+    hash += hash >> 17;
+    hash ^= hash << 25;
+    hash += hash >> 6;
+
+    return hash;
+}
+
+/*
+ * xmlDictComputeFastKey:
+ *
+ * Calculate a hash key using a fast hash function that works well
+ * for low hash table fill.
  */
 static unsigned long
-xmlDictComputeKey(const xmlChar *name, int namelen) {
+xmlDictComputeFastKey(const xmlChar *name, int namelen) {
     unsigned long value = 0L;
     
     if (name == NULL) return(0);
@@ -253,17 +336,18 @@
 }
 
 /*
- * xmlDictComputeQKey:
- * Calculate the hash key
+ * xmlDictComputeFastQKey:
+ *
+ * Calculate a hash key for two strings using a fast hash function
+ * that works well for low hash table fill.
+ *
+ * Neither of the two strings must be NULL.
  */
 static unsigned long
-xmlDictComputeQKey(const xmlChar *prefix, const xmlChar *name, int len)
+xmlDictComputeFastQKey(const xmlChar *prefix, const xmlChar *name, int len)
 {
     unsigned long value = 0L;
     int plen;
-    
-    if (prefix == NULL)
-        return(xmlDictComputeKey(name, len));
 
     plen = xmlStrlen(prefix);
     if (plen == 0)
@@ -435,7 +519,7 @@
     for (i = 0; i < oldsize; i++) {
        if (olddict[i].valid == 0) 
            continue;
-       key = xmlDictComputeKey(olddict[i].name, olddict[i].len) % dict->size;
+       key = xmlDictComputeKey(dict, olddict[i].name, olddict[i].len) % dict->size;
        memcpy(&(dict->dict[key]), &(olddict[i]), sizeof(xmlDictEntry));
        dict->dict[key].next = NULL;
 #ifdef DEBUG_GROW
@@ -452,7 +536,7 @@
             * put back the entry in the new dict
             */
 
-           key = xmlDictComputeKey(iter->name, iter->len) % dict->size;
+           key = xmlDictComputeKey(dict, iter->name, iter->len) % dict->size;
            if (dict->dict[key].valid == 0) {
                memcpy(&(dict->dict[key]), iter, sizeof(xmlDictEntry));
                dict->dict[key].next = NULL;
@@ -570,7 +654,7 @@
     /*
      * Check for duplicate and insertion location.
      */
-    okey = xmlDictComputeKey(name, len);
+    okey = xmlDictComputeKey(dict, name, len);
     key = okey % dict->size;
     if (dict->dict[key].valid == 0) {
        insert = NULL;
@@ -687,7 +771,7 @@
     /*
      * Check for duplicate and insertion location.
      */
-    okey = xmlDictComputeKey(name, len);
+    okey = xmlDictComputeKey(dict, name, len);
     key = okey % dict->size;
     if (dict->dict[key].valid == 0) {
        insert = NULL;
@@ -783,7 +867,7 @@
     /*
      * Check for duplicate and insertion location.
      */
-    okey = xmlDictComputeQKey(prefix, name, len);
+    okey = xmlDictComputeQKey(dict, prefix, name, len);
     key = okey % dict->size;
     if (dict->dict[key].valid == 0) {
        insert = NULL;

Follow-Ups:
- Re: [xml] Better hash function for dict.c
  - From: Daniel Veillard

References:
- [xml] Better hash function for dict.c
  - From: Stefan Behnel
- Re: [xml] Better hash function for dict.c
  - From: Daniel Veillard
- Re: [xml] Better hash function for dict.c
  - From: Stefan Behnel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]