Asp Forum - Hash optimization question

Roger Pack

9/3/2008 5:31:00 AM

Are there any libraries that overcome this problem:

(slow example)

>> Benchmark.measure { a = {:abc => :def}; 100_000.times { a.each{}}}
=> #<... @real=0.113940954208374, @utime=0.0999999999999999,
@cstime=0.0>

(fast example)

>> Benchmark.measure { a = {:abc => :def}.to_a; 100_000.times { a.each{}}}
=> #<...@real=0.0591599941253662, @utime=0.0600000000000001,
@cstime=0.0>

I.e. they optimize hashes such that their .each_xxx functions work as
fast as arrays?
My gut instinct is that this is a problem that could be overcome with
some hacking in the core, but thought I'd ask.

-=R
--
Posted via http://www.ruby-....

10 Answers

Robert Klemme

9/3/2008 9:24:00 AM

2008/9/3 Roger Pack <rogerpack2005@gmail.com>:
> Are there any libraries that overcome this problem:
>
> (slow example)
>
>>> Benchmark.measure { a = {:abc => :def}; 100_000.times { a.each{}}}
> => #<... @real=0.113940954208374, @utime=0.0999999999999999,
> @cstime=0.0>
>
> (fast example)
>
>>> Benchmark.measure { a = {:abc => :def}.to_a; 100_000.times { a.each{}}}
> => #<...@real=0.0591599941253662, @utime=0.0600000000000001,
> @cstime=0.0>
>
>
> I.e. they optimize hashes such that their .each_xxx functions work as
> fast as arrays?
> My gut instinct is that this is a problem that could be overcome with
> some hacking in the core, but thought I'd ask.

No idea, you would have to look at the sources.

A few thoughts nevertheless: if you have to do the array conversion
prior to iterating then numbers suffer dramatically:

irb(main):003:0> Benchmark.measure { a = {:abc => :def}.to_a;
1_000_000.times { a.each{}}}
=> #<Benchmark::Tms:0x7fed1744 @label="", @real=0.641000032424927,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.641, @total=0.641>

irb(main):004:0> Benchmark.measure { a = {:abc => :def};
1_000_000.times { a.each{}}}
=> #<Benchmark::Tms:0x10013be0 @label="", @real=2.58999991416931,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=2.578, @total=2.578>

irb(main):005:0> Benchmark.measure { a = {:abc => :def};
1_000_000.times { a.to_a.each{}}}
=> #<Benchmark::Tms:0x7fee2184 @label="", @real=4.41000008583069,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=4.422, @total=4.422>

In other words: if speeding up relies on building an Array like
structure iterating will be dramatically slower - unless you want to
waste the added space of an Array representation and add the logic to
keep track of changes.

I haven't looked into Ruby's Hash iterating in detail but it might be
slower because of a general properties of hash tables: they are
usually bigger (i.e. have more buckets) than the number of elements in
the hash to allow for efficient hashing. Now if there are no links
between hash entries (whose maintenance I believe would increase
insertion overhead) iteration needs to access each hash bucket even
those with no data in. That might account for the slower runtimes.

I guess it will be difficult to get both: fast hash lookup and
insertion on one hand and fast iteration on the other hand. Is it
worth the effort? I don't know. This depends on your application.
Where is this a problem for you? Maybe there is a more efficient data
structure for your particular problem.

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Robert Dober

9/3/2008 1:27:00 PM

On Wed, Sep 3, 2008 at 11:24 AM, Robert Klemme
<shortcutter@googlemail.com> wrote:

> I guess it will be difficult to get both: fast hash lookup and
> insertion on one hand and fast iteration on the other hand. Is it
> worth the effort? I don't know. This depends on your application.
> Where is this a problem for you? Maybe there is a more efficient data
> structure for your particular problem.
Long time not having been on the same thread Robert:)
Well indeed I would suggest to anyone having problems with performance
of small hashes to have a look at Array#assoc and #rassoc. The only
problem is I do not know up to which number of pairs this behaves
well. I guess n<20 should be quite safe. Maybe I will find some time
to benchmark this later.
Cheers
Robert
--=20
C'est v=E9ritablement utile puisque c'est joli.

Antoine de Saint Exup=E9ry

Robert Klemme

9/3/2008 1:58:00 PM

2008/9/3 Robert Dober <robert.dober@gmail.com>:
> On Wed, Sep 3, 2008 at 11:24 AM, Robert Klemme
> <shortcutter@googlemail.com> wrote:
>
>> I guess it will be difficult to get both: fast hash lookup and
>> insertion on one hand and fast iteration on the other hand. Is it
>> worth the effort? I don't know. This depends on your application.
>> Where is this a problem for you? Maybe there is a more efficient data
>> structure for your particular problem.
> Long time not having been on the same thread Robert:)

:-)

> Well indeed I would suggest to anyone having problems with performance
> of small hashes to have a look at Array#assoc and #rassoc. The only
> problem is I do not know up to which number of pairs this behaves
> well. I guess n<20 should be quite safe. Maybe I will find some time
> to benchmark this later.

Binary search might be an additional option for small to medium sized
collections if retrieval speed is not paramount but iteration speed is
important. (Btw, do we have binary search in the std lib by now?
This should probably go into Array...)

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Roger Pack

9/4/2008 3:14:00 PM

> I guess it will be difficult to get both: fast hash lookup and
> insertion on one hand and fast iteration on the other hand. Is it
> worth the effort? I don't know. This depends on your application.
> Where is this a problem for you? Maybe there is a more efficient data
> structure for your particular problem.

Yeah I guess it's hard to have the best of both worlds. Hmm. I suppose
that for small hashes which call .each a fix would definitely help, for
large hashes that call .each "a lot" it would probably also be quicker,
and for any hashes which don't call .each it wouldn't be helpful. I
wonder if the slowdown is small enough to make it worth it anyway
[though it would use more RAM too--then again, Ruby isn't known for its
low memory usage] :)

In the example I was thinking of, I was using a hash to parse lines in a
file

a la

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}

string.each_line {|l| identifiers.each{|reg, description| if l =~ reg
then; do something; end }

So...my particular example I'm only using a hash because of the clean
syntactic look :) [not for the functionality at all].

Rails uses it quite a bit, too, hence my enquiring about it.

I think that for now I'll just write a Hash monkey patcher a la

a = {}
a.fast_each_ify
a[:abc] = :def # elements from here on are now wrapped internally to
allow for a quicker .each

Thanks!
-=R
--
Posted via http://www.ruby-....

Robert Klemme

9/4/2008 3:42:00 PM

2008/9/4 Roger Pack <rogerpack2005@gmail.com>:
>> I guess it will be difficult to get both: fast hash lookup and
>> insertion on one hand and fast iteration on the other hand. Is it
>> worth the effort? I don't know. This depends on your application.
>> Where is this a problem for you? Maybe there is a more efficient data
>> structure for your particular problem.
>
> Yeah I guess it's hard to have the best of both worlds. Hmm. I suppose
> that for small hashes which call .each a fix would definitely help, for
> large hashes that call .each "a lot" it would probably also be quicker,
> and for any hashes which don't call .each it wouldn't be helpful. I
> wonder if the slowdown is small enough to make it worth it anyway
> [though it would use more RAM too--then again, Ruby isn't known for its
> low memory usage] :)

The critical issue is not the size of the Has (at least not only) but
rather the frequency of change. Personally, since a Hash is primary
for efficient lookups I would not want to invest in any iteration
speedups if there was a chance of lookup and insertion slowdown. If
you are iterating through a Hash all the time then you are probably
not using the proper data structure.

> In the example I was thinking of, I was using a hash to parse lines in a
> file
>
> identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}
>
> string.each_line {|l| identifiers.each{|reg, description| if l =~ reg
> then; do something; end }
>
> So...my particular example I'm only using a hash because of the clean
> syntactic look :) [not for the functionality at all].

Ah!

> I think that for now I'll just write a Hash monkey patcher a la

Why do that? Your identifiers is constant anyway - as far as I can
see. Why not just do

identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}.to_a

Cheers

robert

--
use.inject do |as, often| as.you_can - without end

Roger Pack

9/4/2008 5:03:00 PM

> Why do that? Your identifiers is constant anyway - as far as I can
> see. Why not just do
>
> identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}.to_a

Yeah--works like a charm.

This makes me wish that there were a data structure that has
object-based O(1) lookup/insertion/deletion, and O(n) .each calls.
Apparently Hash doesn't :)

My theory is that some people use hashes just for the convenience of
object-based lookup [ex: scope[:include] in rails] and not for the real
strength of hashes, which, as you mentioned, is "efficient
lookup/insertion/deletion"

Thanks!

-=R
--
Posted via http://www.ruby-....

Robert Klemme

9/5/2008 10:42:00 AM

2008/9/4 Roger Pack <rogerpack2005@gmail.com>:
>> Why do that? Your identifiers is constant anyway - as far as I can
>> see. Why not just do
>>
>> identifiers = {/abc/ => 'an abc line', /def/ => 'a def line'}.to_a
>
> Yeah--works like a charm.

Great!

> This makes me wish that there were a data structure that has
> object-based O(1) lookup/insertion/deletion, and O(n) .each calls.
> Apparently Hash doesn't :)

Not so fast! Keep in mind that these figures are *asymptotic* there
are always constants involved and it might well be that for the N's we
deal with results look very different. That's why we do all the
benchmarking when we want to find out which approach is best. :-)

> My theory is that some people use hashes just for the convenience of
> object-based lookup [ex: scope[:include] in rails] and not for the real
> strength of hashes, which, as you mentioned, is "efficient
> lookup/insertion/deletion"

I am not sure what you mean by "object-based lookup", especially how
it is not "efficient lookup/insertion/deletion" or differs from that.
I mean, people do use these lookups *because* they are efficient.

> Thanks!

You're welcome.

Cheers

robert

--
use.inject do |as, often| as.you_can - without end

Roger Pack

9/24/2008 9:08:00 PM

> Well indeed I would suggest to anyone having problems with performance
> of small hashes to have a look at Array#assoc and #rassoc. The only
> problem is I do not know up to which number of pairs this behaves
> well. I guess n<20 should be quite safe. Maybe I will find some time
> to benchmark this later.

So our goal is to see if Array#assoc is "about as fast as hash based
lookup"

>> a = {1 => 2, 2 => 3 }
>> b = [[1, 2], [2, 3]]

>> Benchmark.measure { 1000000.times { a[2] }}
=> #<Benchmark::Tms:0x5d93dc @total=0.26, @cutime=0.0, @label="",
@stime=0.0, @real=0.270273208618164, @utime=0.26, @cstime=0.0>

>> Benchmark.measure { 1000000.times { b.assoc 1 }}
=> #<Benchmark::Tms:0x5b2e1c @total=0.26, @cutime=0.0, @label="",
@stime=0.0, @real=0.264580965042114, @utime=0.26, @cstime=0.0>

faster than a hash :)

>> Benchmark.measure { 1000000.times { b.assoc 2 }}
=> #<Benchmark::Tms:0x5e4408 @total=0.35, @cutime=0.0, @label="",
@stime=0.01, @real=0.35266900062561, @utime=0.34, @cstime=0.0>

Quite a bit slower.

So with only 2 entries #assoc is slower.

Oh well, we do what we can.
-=R
--
Posted via http://www.ruby-....

Brian Candler

9/25/2008 9:26:00 AM

> I haven't looked into Ruby's Hash iterating in detail but it might be
> slower because of a general properties of hash tables: they are
> usually bigger (i.e. have more buckets) than the number of elements in
> the hash to allow for efficient hashing. Now if there are no links
> between hash entries (whose maintenance I believe would increase
> insertion overhead) iteration needs to access each hash bucket even
> those with no data in. That might account for the slower runtimes.

Aside: ruby-1.9 keeps links between hash entries, exactly for the
purpose of faster iteration. It also has the potentially useful
side-effect that hashes iterate in the same order that the entries were
inserted.

(However, depending on this feature is a good way of writing code which
won't run properly in 1.8)
--
Posted via http://www.ruby-....

Roger Pack

9/26/2008 8:31:00 PM

> Aside: ruby-1.9 keeps links between hash entries, exactly for the
> purpose of faster iteration. It also has the potentially useful
> side-effect that hashes iterate in the same order that the entries were
> inserted.

Interesting.
It appears that whatever the internal linking mechanism, it still
differs from Array#each speed-wise:

1.8.6 Hash:

>> Benchmark.measure { a = {:a => :b, :c => :d}; 100_000.times { a.each {} }}
=> #<Benchmark::Tms:0x26c2f8 @total=0.17, @cutime=0.0, @label="",
@stime=0.0, @real=0.173249959945679, @utime=0.17, @cstime=0.0>

1.8.6 Array:
>> Benchmark.measure { a = [[:a => :b], [:c => :d]]; 100_000.times { a.each {} }}
=> #<Benchmark::Tms:0x5de7c4 @total=0.07, @cutime=0.0, @label="",
@stime=0.0, @real=0.0752890110015869, @utime=0.07, @cstime=0.0>

so 0.17 to 0.7 s

and on 1.9 Hash:
>> Benchmark.measure { a = {:a => :b, :c => :d}; 100_000.times { a.each {} }}
=> #<Benchmark::Tms:0x130efc @label="", @real=0.142009973526001,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.14, @total=0.14>

1.9 Array:
>> Benchmark.measure { a = [[:a, :b], [:c, :d]]; 100_000.times { a.each {} }}
=> #<Benchmark::Tms:0x13379c @label="", @real=0.0471560955047607,
@cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.05, @total=0.05>

so 0.14 to 0.05 s

The numbers seem very similar, comparison wise, to each other.

-=R
--
Posted via http://www.ruby-....

comp.lang.ruby

Hash optimization question

Roger Pack

Robert Klemme

Robert Dober

Robert Klemme

Roger Pack

Robert Klemme

Roger Pack

Robert Klemme

Roger Pack

Brian Candler

Roger Pack

x Login to ForumsZone