Asp Forum - String Hashing Algorithms

Gavin Kistner

5/10/2005 7:58:00 PM

Summary
=============================================================
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.

Background
=============================================================
At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.

As a result, I wanted to do a preliminary investigation into various
string hashing algorithms, to see how limiting that would be. (How
often would we need to rename a bunch of our strings just to get around
collisions?) Given the nature of hashing algorithms, it was important
that this be based on ~real world strings.

Methodology
=============================================================
I wrote a ruby script to scan all our (C++) source code and find ANY
words that were valid C++ identifiers. This also meant that I grabbed
all such words from inside comment blocks, as well as C++ keywords, but
that was OK for this back-of-the-envelope testing. I ended up with
almost 43,000 unique identifiers.

I then ran through the list of words and used 7 different string
hashing algorithms (which I ported from the C code I found on a web
page[1]) to compute different hash keys. I used a Set (per hashing
algorithm) to store these keys, and at the end compared the number of
keys in the Set with the number of identifiers fed in.

The Results
=============================================================
Three of the hashing algorithms ended up with no collisions whatsoever,
and one of these (SDBM) is really simple and fast (in C).

Here is the output of my test:
djb :: 99.8601 percent coverage (60 collisions out of 42884)
elf :: 99.5430 percent coverage (196 collisions out of 42884)
sdbm :: 100.0000 percent coverage (0 collisions out of 42884)
js :: 99.9044 percent coverage (41 collisions out of 42884)
ap :: 100.0000 percent coverage (0 collisions out of 42884)
bkdr :: 99.9953 percent coverage (2 collisions out of 42884)
rs :: 100.0000 percent coverage (0 collisions out of 42884)

The Algorithms
=============================================================
module HashEm
SIGNEDSHORT = 0x7FFFFFFF

def self.rs( str, len=str.length )
a,b = 63689,378551
hash = 0
len.times{ |i|
hash = hash*a + str[i]
a *= b
}
hash & SIGNEDSHORT
end

def self.js( str, len=str.length )
hash = 1315423911
len.times{ |i|
hash ^= ( ( hash << 5 ) + str[i] + ( hash >> 2 ) )
}
hash & SIGNEDSHORT
end

def self.elf( str, len=str.length )
hash = 0
x = 0
len.times{ |i|
hash = (hash << 4) + str[i]
if (x = hash & 0xF0000000) != 0
hash ^= (x >> 24)
hash &= ~x
end
}
hash & SIGNEDSHORT
end

def self.bkdr( str, len=str.length )
seed = 131 # 31 131 1313 13131 131313 etc..
hash = 0
len.times{ |i|
hash = ( hash*seed )+str[i]
}
hash & SIGNEDSHORT
end

def self.sdbm( str, len=str.length )
hash = 0
len.times{ |i|
hash = str[i] + ( hash << 6 ) + ( hash << 16 ) - hash
}
hash & SIGNEDSHORT
end

def self.djb( str, len=str.length )
hash = 5381
len.times{ |i|
hash = ((hash << 5) + hash) + str[i]
}
hash & SIGNEDSHORT
end

def self.ap( str, len=str.length )
hash = 0
len.times{ |i|
if (i & 1) == 0
hash ^= (hash << 7) ^ str[i] ^ (hash >> 3)
else
hash ^= ~( (hash << 11) ^ str[i] ^ (hash >> 5) )
end
}
hash & SIGNEDSHORT
end
end

[1] http://www.code-source.org/algorithm...

15 Answers

Joel VanderWerf

5/10/2005 8:18:00 PM

Phrogz wrote:
> Summary
> =============================================================
> A brief analysis of Ruby versions of a few string hashing algorithms,
> and my conclusions about their efficacy at preventing collisions.

Is ruby's own String#hash one of the ones you tested? If not, how does
it compare?

Gavin Kistner

5/10/2005 9:25:00 PM

Joel VanderWerf wrote:
> Phrogz wrote:
> > A brief analysis of Ruby versions of a few string hashing
algorithms,
> > and my conclusions about their efficacy at preventing collisions.
>
> Is ruby's own String#hash one of the ones you tested? If not, how
does
> it compare?

Because I was interested in C algorithms, I didn't consider it. But, I
just ran it again, and ruby came out with the best of them:

ruby :: 100.0000 percent coverage (0 collisions out of 42884)

Joel VanderWerf

5/10/2005 9:33:00 PM

Phrogz wrote:
> Joel VanderWerf wrote:
>
>>Phrogz wrote:
>>
>>>A brief analysis of Ruby versions of a few string hashing
>
> algorithms,
>
>>>and my conclusions about their efficacy at preventing collisions.
>>
>>Is ruby's own String#hash one of the ones you tested? If not, how
>
> does
>
>>it compare?
>
>
> Because I was interested in C algorithms, I didn't consider it. But, I
> just ran it again, and ruby came out with the best of them:
>
> ruby :: 100.0000 percent coverage (0 collisions out of 42884)
>

It is coded in C, and looking at rb_str_hash() in string.h, I see that
it is one of HASH_ELFHASH, HASH_PERL, or a third (very simple) one.

Gavin Kistner

5/10/2005 10:17:00 PM

How can I figure out which hash it's using? It's not the ELF hash,
since that was the worst performer of the lot. Searching the source for
HASH_PERL I don't see it #defined anywhere, but my C is really, really
rusty. Does this mean that it's using the 'fallback' hash algorithm?
Does it vary by build per platform?

Zed A. Shaw

5/10/2005 10:39:00 PM

On Wed, 2005-05-11 at 05:00 +0900, Phrogz wrote:
>
> Background
> =============================================================
> At work, we have some (young) code which is currently doing a lot of
> string compares, which needs to be really fast. We're about to try
> using hashes of the strings to represent each string instead, but
> instead of a traditional hash table (which 'chains' collisions for the
> same key) we're going to require that no two (different) strings may
> share the same hash key.

Just out of curiosity, but have you considered some other
structures/algorithms which might be alternatives depending on your
usage? Off the top of my head I can think of:

* Trie -- Should find other strings really fast, but gets pretty big the
more strings you need to store. There's a C library for this at
http://www.octavian.org/cs/sof...
* PATRICIA -- Basically a compacted Trie which takes less space.
Couldn't find a library for this one.
* Suffix Array -- I have a Ruby binding for one of the faster C
libraries which I use in FastCST. The big advantage of a Suffix Array
is that you can store it so that you only need to calculate the suffix
array once. A suffix array is really more useful for matching and
exclusion.
* Suffix Tree -- There's really not much of a reason to use suffix trees
these days since newer suffix array construction algorithms are faster.
The main advantage of a suffix tree is that searching for a result can
be faster. The main disadvntages are that they are fat memory pigs.
* Bloom Filter -- These are not as acurate, but they can be fast as all
hell if you just want to match strings with some probability.

Anyway, just thought I'd throw out some alternatives to hash tables for
certain situations. I'd say if you need to match C++ keywords to a
stored index then take a look at the libtst Trie implementation. It's
quite fast for that application. If you just need to see if a certain
word is "included" or "excluded" than a suffix array could do that
really fast. Trick there is to build the suffix array by joining the
strings with a | character (or something else) between them. Nice thing
about a suffix array is that you can build it offline and then just load
it directly for the searching.

Zed A. Shaw
http://www.ze...

Clifford Heath

5/10/2005 11:25:00 PM

Phrogz wrote:
> Summary

Good summary and test, Phrogz.

Don't forget that in many hashing algorithms, the cost of computing
the hash function far outweighs the occasional cost of traversing a
chain. If you need guaranteed lookup time, you need a "perfect hash"
(try Googling on that) but otherwise you just need a fast hash function
with an excellent spread and a low load factor. Neither of those is
hard to achieve.

We use linear hashing (note, this is not the same as linear open
addressing, a much older technique), which achieves excellent
performance with any size hash table. Unfortunately I have no
implementation I can publish.

Clifford Heath.

Yukihiro Matsumoto

5/10/2005 11:59:00 PM

Hi,

In message "Re: String Hashing Algorithms"
on Wed, 11 May 2005 07:20:26 +0900, "Phrogz" <gavin@refinery.com> writes:

|How can I figure out which hash it's using?

We are using the last one, unless specifically defined.

matz.

Charles Mills

5/11/2005 5:16:00 AM

Zed A. Shaw wrote:
> On Wed, 2005-05-11 at 05:00 +0900, Phrogz wrote:
> >
> > Background
> > =============================================================
> > At work, we have some (young) code which is currently doing a lot
of
> > string compares, which needs to be really fast. We're about to try
> > using hashes of the strings to represent each string instead, but
> > instead of a traditional hash table (which 'chains' collisions for
the
> > same key) we're going to require that no two (different) strings
may
> > share the same hash key.
>
> Just out of curiosity, but have you considered some other
> structures/algorithms which might be alternatives depending on your
> usage? Off the top of my head I can think of:
>
> * Trie -- Should find other strings really fast, but gets pretty big
the
> more strings you need to store. There's a C library for this at
> http://www.octavian.org/cs/sof...
> * PATRICIA -- Basically a compacted Trie which takes less space.
> Couldn't find a library for this one.
> * Suffix Array -- I have a Ruby binding for one of the faster C
> libraries which I use in FastCST. The big advantage of a Suffix
Array
> is that you can store it so that you only need to calculate the
suffix
> array once. A suffix array is really more useful for matching and
> exclusion.
> * Suffix Tree -- There's really not much of a reason to use suffix
trees
> these days since newer suffix array construction algorithms are
faster.
> The main advantage of a suffix tree is that searching for a result
can
> be faster. The main disadvntages are that they are fat memory pigs.
> * Bloom Filter -- These are not as acurate, but they can be fast as
all
> hell if you just want to match strings with some probability.
>
> Anyway, just thought I'd throw out some alternatives to hash tables
for
> certain situations. I'd say if you need to match C++ keywords to a
> stored index then take a look at the libtst Trie implementation.
It's
> quite fast for that application. If you just need to see if a
certain
> word is "included" or "excluded" than a suffix array could do that
> really fast. Trick there is to build the suffix array by joining the
> strings with a | character (or something else) between them. Nice
thing
> about a suffix array is that you can build it offline and then just
load
> it directly for the searching.
>
> Zed A. Shaw
> http://www.ze...

You may be interested in this project:
http://www.rubyforge.org/pro...

Has an implementation of something like what you describe as a PATRICIA
above and a sparse hash.

-Charlie

martinus

5/11/2005 6:05:00 AM

> At work, we have some (young) code which is currently doing a lot of
> string compares, which needs to be really fast. We're about to try
> using hashes of the strings to represent each string instead, but
> instead of a traditional hash table (which 'chains' collisions for the
> same key) we're going to require that no two (different) strings may
> share the same hash key.

If you know all possible strings in advance, you might want to generate
a perfect hash:
http://burtleburtle.net/bob/hash/pe...

martinus

Gavin Kistner

5/11/2005 1:08:00 PM

On May 11, 2005, at 12:05 AM, Martin Ankerl wrote:
> If you know all possible strings in advance, you might want to
> generate a perfect hash:
> http://burtleburtle.net/bob/hash/pe...

Ooh, that's quite helpful, thanks!

comp.lang.ruby

String Hashing Algorithms

Gavin Kistner

Joel VanderWerf

Gavin Kistner

Joel VanderWerf

Gavin Kistner

Zed A. Shaw

Clifford Heath

Yukihiro Matsumoto

Charles Mills

martinus

Gavin Kistner

x Login to ForumsZone