Re: hash table freeze is useless



Havoc Pennington <hp@redhat.com> wrote,

> I was just noticing that g_hash_table_insert() performs a hash lookup.  
> This means that if you freeze the hash table, then insert a lot of items,
> the performance of g_hash_table_insert() will be degenerating into O(n)
> over time, so your insert-items loop will have O(n^2) performance. Without
> the freeze, it is basically O(n) (with the hash resizes amortized, and
> each insertion constant time).
> 
> For small values of n, it's probably a toss-up; for large values of n,
> freeze() almost certainly causes a performance hit.
> 
> Unless I'm wrong of course, please comment. ;-)
> 
> Recommended solution would be to make freeze/thaw no-ops and say in the
> docs that they are deprecated.
> 
> A better solution than freeze/thaw might be a way to set the expected
> number of items, g_hash_table_resize_for(hash, n_items) which would resize
> the table in advance to accomodate that number of additional items, saving 
> n_items re-insertions and perhaps reducing the total number of resizes.

Did anybody ever notice a performance problem when inserting
or removing a large number items without freezing the table?
I believe that there won't be any performance problem and
table freeze is indeed useless.

The reason for this claim is that table resizing is done in
exponential steps, ie, in the current implementation the
table size is tripled when growing or divided by three when
shrinking.  Let's suppose, we insert `n' elements into an
empty table.  There will be `ld3 n' resizes, where `ld3' is
denoting the logarithm to base 3.  The size of the table
during resizing will be 3^1, 3^2, 3^3, 3^4, etc., or in
other words, the sum of the number of items touched during
all resize operations together is

    \sum_{n=1}^{ld3 n} 3^n 
  <
    2 * n

(This calculation neglects that hash table sizes are
required to be prime numbers, but that won't change the
qualitative result, I believe.)

This means that doing all the intermediate resize operations
is less than twice as much work as doing only a single
resize in the end, or in other words, the last resize will
dominate the overall cost anyway.

Even g_hash_table_resize_for() won't help much.  Compared to 
inserting `n' elements, resizing a table with `n' elements
(the dominate cost of resizing) is relatively cheap.
Resizing is done in a compact nested loop:

  new_nodes = g_new0 (GHashNode*, new_size);
  for (i = 0; i < hash_table->size; i++)
    for (node = hash_table->nodes[i]; node; node = next)
      {
	next = node->next;

	hash_val = (* hash_table->hash_func) (node->key) % new_size;

	node->next = new_nodes[hash_val];
	new_nodes[hash_val] = node;
      }

In contrast, insert requires a function call for each insert
and allocates each node at a time.  In fact resizing may be
beneficial for performance, because all nodes are getting to
be in a consecutive memory area, which will usually improve
the cache hit rate when there are successive hash table
operations.

Manuel



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]