Re: hash table freeze is useless

On Fri, 5 Nov 1999, Derek Simkowiak wrote:

> Date: Fri, 5 Nov 1999 15:30:08 -0800 (PST)
> From: Derek Simkowiak <>
> Reply-To:
> To:
> Subject: Re: hash table freeze is useless
> Resent-Date: 5 Nov 1999 23:35:09 -0000
> Resent-From:
> Resent-cc: recipient list not shown: ;
> > > I believe the intent is so you can freeze a hashtable that is 
> > > large, delete half the items, insert a bunch of new items
> > > without resizing the table extraneously.
> [...]
> > Ah yes, it works for that. I do think the function is intended (or at
> > least documented) to work in the case where the table was not already
> > large, however. And even in the above case it breaks if you insert more
> > items than you delete.
> 	Why does g_hash_table_insert perform a lookup (by calling
> g_hash_table_lookup_node)?
> 	It does so because if the node already exists, then we simply want
> to overwrite the value (of the key/value pair) that is already there.  As
> the source says:

On a different topic, this is a waste for the sake of convenience. It means
that every insertion of a *new* key is at least as expensive as a search for a
non-existent node, which requires traversing a chain all the way to the end.
The task of checking for a duplicate could be delegated to the application,
which could then replace the data value if a matching node is found, else
allocate a new one.  Hrm. It's a matter of interface religion, I suppose.

>   if (*node)
>     {
>       /* do not reset node->key in this place, keeping
>        * the old key might be intended.
>        * a g_hash_table_remove/g_hash_table_insert pair
>        * can be used otherwise.
>        *
>        * node->key = key; */
>       (*node)->value = value;
>     }
> 	...but if the node does *not* exist, then we must create a new
> node with g_hash_node_new.  And after adding a new node, we should see if
> the hash table needs resizing:
>   else
>     {
>       *node = g_hash_node_new (key, value);
>       hash_table->nnodes++;
>       if (!hash_table->frozen)
>         g_hash_table_resize (hash_table);

Ah, so if a table was just thawed after massive deletion, an insertion could,
paradoxically, trigger shrinkage. Thus this algorithm has to check for both
possibilities (growth or shrinkage) after each insert or delete operation.

>     }  
> 	Notice that the g_hash_table_resize only gets called if the table
> is not frozen.  g_hash_table_resize does a few calculations to determine
> whether or not a resizing of the table is actually necessary.  Here
> are those calculations:
>   if ((nodes_per_list > 0.3 || hash_table->size <= HASH_TABLE_MIN_SIZE) &&
>       (nodes_per_list < 3.0 || hash_table->size >= HASH_TABLE_MAX_SIZE))
>     return;
> 	When the table is frozen, you save those calculations (and any
> actual resizing that would occur) every time you insert an item.

Why do they do all the floating point math? The trigger load factors could be
converted to integers each time a shrinkage or growth occurs.  Then to decide
whether to grow or shrink, all you need is to compare the number of nodes
against the integral trigger bounds. 

The nodes_per_list is simply the load factor, or nodes / chains. Thus

	nodes_per_list > 0.3

can be rewritten as

	nodes / chains > 0.3

Then multiplying both sides by chains yields

	nodes > 0.3 * chains

and of course, the right hand side does not vary with insertions and
deletions, since the number of chains changes only during a growth or
shrinking! Thus the right hand side can be pre-computed into an integer.
So you never have to compute or maintain the nodes_per_list (load factor)

My current implementation uses load factors 0.5 and 2.0: load greater than
or equal to 2.0 triggers growth, half or less triggers shrinking.
The table size, and limits, are always powers of two, so they simply 
shift left or right during growth and shrinking.

I also resize the table in place.

> 	When the table is thawed, the table is resized one time only.  A
> resize involves going through this loop:
>   for (i = 0; i < hash_table->size; i++)
>     for (node = hash_table->nodes[i]; node; node = next)
>       {
>         next = node->next;
>         hash_val = (* hash_table->hash_func) (node->key) % new_size;

This could be improved by storing the hashed value in each node instead of
calling the hashing function again?  It would not only speed up grow and shrink
operations, but insert and lookup operations too.

If you have the hash value stored in each node, you can use it to quickly
reject keys that do not match simply by comparing hashed values. A failed
search then will perform, in many cases, no comparisons at all, and in many
cases a successful search for a key will perform only one comparison to verify
the match. The rare exceptions occur when there is a false positive match on
the hash value between distinct keys.

Of course, these two advantages must be weighed against the extra storage per

>       node->next = new_nodes[hash_val];
>       new_nodes[hash_val] = node;

The resizing could actually be done in place with a little cleverness, if the
table sizes are constrained to powers of two.  Note that when you double the
size of a table, you expose one extra bit in each hash value.
What happens is that the nodes from each chain either stay in the same
chain C, or they go to chain N + C, where N is the old table size.
(Let's call this the cousin chain).

So you can grow the table in place (at least potentially) using realloc()
and then do this chain splitting where you use the newly exposed bit as
a discriminator---zero == node stays, one == node moves to new cousin chain in
the upper half of the table.

Shrinking is then nice too. Merge cousin chains from the upper and lower
half of the table, so that the upper half of the table is vacated. Then realloc
the table to half the original size. Note that you never have to examine the
keys (never mind calling the hashing function). Depending on the chain
implementation, the merge might even be done without tail chasing.

Anyway, those are just my rants based on the presented code snippets. ;)

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]