Re: [xml] Better hash function for dict.c



On Tue, Apr 22, 2008 at 04:19:57AM -0400, Daniel Veillard wrote:
On Sun, Apr 20, 2008 at 10:51:38AM +0200, Stefan Behnel wrote:
Hi again (and sorry for all the noise),

  Sorry for getting back to this issue.

numbers, it works pretty well over a wide range of vocabulary sizes from small
to medium, and as I've shown before, it performs much (much!) better for
larger sizes.

BTW, Bob Jenkins did a comparison of a couple of hash functions, including the
additive hash (a variant of which is currently used) and the hash function
used in the patch.

http://burtleburtle.net/bob/hash/doobs.html

The hash function itself was written by Paul Hsieh and published on his web
site. According to Bob Jenkins, it's public domain (although I didn't ask
directly).

http://www.azillionmonkeys.com/qed/hash.html

Any objections to getting this patch merged?

  None, looks very good, thanks a lot for assembling all of this and providing
full and convincing numbers :-)

  I will apply this to SVN head now,

  After looking at some weird regression failures raised by libxslt,
I'm afraid I will have to revert this patch at least temporary until we
fix 2 problems I found recently:
  - the first one is a bug in the change, and affects only the case where
    you use subdictionaries, in the 3 lookup functions we reused the same okey
    when accessing the subdictionary, but both dictionaries may use different
    functions depending on their side. So sometimes the key value need to be
    recomputed when looking for the string in the subdict. the fix is by
    doing the recomputation if needed as in:

     if (dict->subdict) {
-       key = okey % dict->subdict->size;
+        unsigned long skey;
+
+        /* we cannot always reuse the same okey for the subdict */
+        if (((dict->size == MIN_DICT_SIZE) &&
+            (dict->subdict->size != MIN_DICT_SIZE)) ||
+            ((dict->size != MIN_DICT_SIZE) &&
+            (dict->subdict->size == MIN_DICT_SIZE)))
+           skey = xmlDictComputeKey(dict->subdict, name, len);
+       else
+           skey = okey;
+
+       key = skey % dict->subdict->size;

  - the second one is unfortunately not fixeable as is it comes from the
    key hash definitions themselves:

-#define xmlDictComputeKey(dict, name, len)             \
-    (((dict)->size == MIN_DICT_SIZE) ?                \
-     xmlDictComputeFastKey(name, len) :               \
-     xmlDictComputeBigKey(name, len, len))            \
-
-#define xmlDictComputeQKey(dict, prefix, name, len)                 \
-    (((prefix) == NULL) ?                                           \
-      (xmlDictComputeKey(dict, name, len)) :                        \
-      (((dict)->size == MIN_DICT_SIZE) ?                            \
-       xmlDictComputeFastQKey(prefix, name, len) :                  \
-                           xmlDictComputeBigKey(name, len, len))))

    the problem is that basically if you compute the key for a QName
as "a:b" you can get 2 different answers, one if you accessed it using
"a:b" directly and hence xmlDictComputeKey() or if using "a" prefix and
"b" name, given the algorithm the key are not the same, and it is a key
property of the dictionary to always return the same exact pointer for
the same string. This breaks that property.
  The old algorithm allows to make that computation without building the
assembled full string. In the new one that's harder (doable but harder)
I'm afraid there would be quite a bit of nastyness implementing this as
the new algorithm really like to access things 4bytes by 4bytes, we risk
unaligned accesses, and a very complex algorithm to build this.

  I will try to look at this again tomorrow but that's not trivial.

Daniel


-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]