Re: Proposal for a new ghash table



On Mon, 29 Nov 1999, Emmanuel DELOGET wrote:

> > The only problem I see, the way I understand it, is that unsuccessful
> lookups
> > are now O(log n), so this neat algorithm is paid for by an order increase
> in
> > one of the hash operations.
>     Not really. If the hash fuction is good enough to have a O(1) search in
>     a classic hash table, then you'll end up with a O(32) [:)] search in
> this
>     implementation since you do not have more that 32 tables to search
>     (well, actually you may have more than 32 tables if you have more
>     than 2^32 bytes to store the tables :)

Heh heh. Thus, by this logic, binary trees have constant time lookup too.
Assuming my nodes are 12 bytes <left, right, data> I can fill up most of the
4GB address space with a full, balanced tree having 2^28-1 nodes. Therefore any
search is O(28), therefore constant time! ;)

>     By using a base size of 2^5 and reducing the max size to 2^20
>     (these are good limits since you do not have to alloc too many tables

You just about might as well use a balanced binary search tree then.

>     for a small set of nodes, but you still have the possibility to store a
>     large number of nodes in the two or three last tables), you limit
>     to 16 the number of search in the worst case.

The worst case for a search remains bounded by O(n), because it's still
a hash table. ;)

>     Well. We can compare this to the current implementation :

My concern was whether the lazy movement of nodes after a growth is superior
to doing immediate movement, not whether any of this is better than the
current implementation (which is obviously less than state of the art in
some areas).

The conclusion is that the lazy transfer raises the *minimum* bound on the time
for *unsuccessful* lookups from o(1) to o(log n), (with the upper bound
remaining at O(n)) which could adversely impact the performance of some hash
uses. 

Thus, for instance, the operation of inserting n nodes into a table,
which duplicate checking for each insertion, now has a minimum
performance bound of o(n log n), which is also the average case
assuming a well-behaved hashing function.

If you are going to pay o(n log n) time to construct a dictionary, you might
has well have something to show for it, like having it in sorted order.

With immediate movement (nodes are transferred to their proper chain on
each growth) the cost of populating the hash table has an o(n) lower bound,
rather than o(n log n), because searches for non-existent entries can be
declared unsuccessful after traversing just one chain. That's without
any freezing/thawing, which would actually be detrimental by slowing
down the dupe checks.

You might not thing that unsuccessful hash table lookups are important.  But
the fact is that g_hash_insert actually performs a lookup before inserting; the
authors apparently thought that checking for duplicates is so important that it
should be buried in the interface.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]