[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Re: array comparison

Robert Klemme

10/30/2008 7:48:00 AM

2008/10/30 Chad Perrin <perrin@apotheon.com>:
> I can easily write a program to compare the contents of arrays, of
> course. Ruby's great that way. In a matter of a minute or so, I could
> write a program that compares small numbers of items in a list with small
> numbers of items in another list and give me output that consists of
> things that appear in both, or those that don't appear in both, or those
> that appear in one and not the other.
>
> I find myself contemplating doing much the same thing, but with lists
> that contain millions of entries. I tend to guess that loading each list
> into an array and running a direct comparison of them:
>
> array_1 = [millions of things]
> array_2 = [millions of things]
> array_3 = array1 & array2
>
> . . . would fill up RAM in a hurry and drag system performance on a
> typical desktop computer to a standstill. What sort of approach would
> the expert Ruby hackers suggest for achieving much the same ends without
> taking all week and risking a stack overflow?

First of all I would use Set which is far more efficient for these
amounts. You need to make sure that your items in the Set implement
#hash and #eql? properly (see requirements for Hash keys).

Then, I would write a simple program using Sets and see what happens
before I even start to speculate whether it would work or not. This
program is written in a matter of minutes and if it works you are done
- if not, you may be able to tweak the simple solution to work (e.g.
for example by making sure that there is only one instance of string
values that are frequently seen in the input).

Btw, if you compare two "lists" you only need to keep one in memory to
make some of the checks.

Kind regards

robert


--
remember.guy do |as, often| as.you_can - without end

5 Answers

Pit Capitain

10/30/2008 7:55:00 AM

0

2008/10/30 Robert Klemme <shortcutter@googlemail.com>:
> First of all I would use Set which is far more efficient for these
> amounts.

+1

If the performance still isn't adequate, I'd use a database.

Regards,
Pit

William James

10/30/2008 5:35:00 PM

0

On Oct 30, 2:47 am, Robert Klemme <shortcut...@googlemail.com> wrote:
> 2008/10/30 Chad Perrin <per...@apotheon.com>:
>
>
>
> > I can easily write a program to compare the contents of arrays, of
> > course.  Ruby's great that way.  In a matter of a minute or so, I could
> > write a program that compares small numbers of items in a list with small
> > numbers of items in another list and give me output that consists of
> > things that appear in both, or those that don't appear in both, or those
> > that appear in one and not the other.
>
> > I find myself contemplating doing much the same thing, but with lists
> > that contain millions of entries.  I tend to guess that loading each list
> > into an array and running a direct comparison of them:
>
> >  array_1 = [millions of things]
> >  array_2 = [millions of things]
> >  array_3 = array1 & array2
>
> > . . . would fill up RAM in a hurry and drag system performance on a
> > typical desktop computer to a standstill.  What sort of approach would
> > the expert Ruby hackers suggest for achieving much the same ends without
> > taking all week and risking a stack overflow?

Array intersection is pretty efficient in Ruby.

N = 4_000_000
a = Array.new(N){ rand N }.uniq
b = Array.new(N){ rand N }.uniq
p N, a.size, b.size

time = Time.now
ha = {}
a.each{|x| ha[x] = true }
hb = {}
b.each{|x| hb[x] = true }
intersect = []
ha.each_key{|x| intersect << x if hb.include? x}
puts intersect.size
print Time.now - time, " seconds\n"

time = Time.now
puts (a & b).size
print Time.now - time, " seconds\n"


--- output ---
4000000
2528918
2527239
1597752
12.64 seconds
1597752
5.063 seconds

Robert Klemme

10/30/2008 6:32:00 PM

0

On 30.10.2008 18:35, William James wrote:
> On Oct 30, 2:47 am, Robert Klemme <shortcut...@googlemail.com> wrote:
>> 2008/10/30 Chad Perrin <per...@apotheon.com>:
>>
>>
>>
>>> I can easily write a program to compare the contents of arrays, of
>>> course. Ruby's great that way. In a matter of a minute or so, I could
>>> write a program that compares small numbers of items in a list with small
>>> numbers of items in another list and give me output that consists of
>>> things that appear in both, or those that don't appear in both, or those
>>> that appear in one and not the other.
>>> I find myself contemplating doing much the same thing, but with lists
>>> that contain millions of entries. I tend to guess that loading each list
>>> into an array and running a direct comparison of them:
>>> array_1 = [millions of things]
>>> array_2 = [millions of things]
>>> array_3 = array1 & array2
>>> . . . would fill up RAM in a hurry and drag system performance on a
>>> typical desktop computer to a standstill. What sort of approach would
>>> the expert Ruby hackers suggest for achieving much the same ends without
>>> taking all week and risking a stack overflow?
>
> Array intersection is pretty efficient in Ruby.
>
> N = 4_000_000
> a = Array.new(N){ rand N }.uniq
> b = Array.new(N){ rand N }.uniq
> p N, a.size, b.size
>
> time = Time.now
> ha = {}
> a.each{|x| ha[x] = true }
> hb = {}
> b.each{|x| hb[x] = true }
> intersect = []
> ha.each_key{|x| intersect << x if hb.include? x}
> puts intersect.size
> print Time.now - time, " seconds\n"
>
> time = Time.now
> puts (a & b).size
> print Time.now - time, " seconds\n"
>
>
> --- output ---
> 4000000
> 2528918
> 2527239
> 1597752
> 12.64 seconds
> 1597752
> 5.063 seconds

The comparison is a bit unfair since you include the build time for the
structure in case of Hash but not for the Array. You should at least
also compare just the intersection time. And while you're at it you can
also add Set to the mix. :-)

Kind regards

robert

Sarcar, Shourya C (GE Healthcare)

10/31/2008 10:22:00 AM

0


> -----Original Message-----
> From: Robert Klemme [mailto:shortcutter@googlemail.com]=20
> Sent: Friday, October 31, 2008 12:04 AM
> To: ruby-talk ML
> Subject: Re: array comparison
>=20
> The comparison is a bit unfair since you include the build=20
> time for the structure in case of Hash but not for the Array.=20
> You should at least also compare just the intersection time.=20
> And while you're at it you can also add Set to the mix. :-)


Interesting discussion.

I wrote a small program to compare Array intersection to Set
intersection.
I noticed that for smaller sets (~10 ** 3), Array is much faster that
Set, sometimes upto 5x.
As the set size grows, they seem to converge (~10 ** 7, 100 million)

# Program to compare timings for intersections
require 'set'

n =3D 10000
while (n <=3D 10 ** 7) do=20
c =3D Array.new(n) {rand n}.uniq
d =3D Array.new(n) {rand n}.uniq

# Native array intersection
t0 =3D Time.now
cd =3D c & d
t1 =3D Time.now
printf("Array | n =3D %d : %d intersects found in %f
seconds\n",n,cd.size,t1-t0)
=09
# Convert array to Set objects and perform intersection
c =3D c.to_set
d =3D d.to_set
t2 =3D Time.now
cd =3D c & d
t3 =3D Time.now
printf("Set | n =3D %d : %d intersects found in %f
seconds\n",n,cd.size,t3-t2)
printf("Ratio of Set/Array intersect time =3D
%.2f\n",(t3-t2)/(t1-t0))

n =3D 2 * n
end


=3D=3D OUTPUT =3D=3D [Dell Latitude 620 , Win XP SP2, 2 GB RAM, Intel =
Core2 CPU
@ 1.66 GHz, normal loading]

D:\Shourya\DEV\RUBY\MISC>ruby intersect.rb
Array | n =3D 10000 : 3980 intersects found in 0.000000 seconds
Set | n =3D 10000 : 3980 intersects found in 0.016000 seconds
Ratio of Set/Array intersect time =3D Inf
Array | n =3D 20000 : 7956 intersects found in 0.000000 seconds
Set | n =3D 20000 : 7956 intersects found in 0.032000 seconds
Ratio of Set/Array intersect time =3D Inf
Array | n =3D 40000 : 15940 intersects found in 0.016000 seconds
Set | n =3D 40000 : 15940 intersects found in 0.078000 seconds
Ratio of Set/Array intersect time =3D 4.88
Array | n =3D 80000 : 31835 intersects found in 0.047000 seconds
Set | n =3D 80000 : 31835 intersects found in 0.125000 seconds
Ratio of Set/Array intersect time =3D 2.66
Array | n =3D 160000 : 63959 intersects found in 0.125000 seconds
Set | n =3D 160000 : 63959 intersects found in 0.281000 seconds
Ratio of Set/Array intersect time =3D 2.25
Array | n =3D 320000 : 128062 intersects found in 0.391000 seconds
Set | n =3D 320000 : 128062 intersects found in 0.562000 seconds
Ratio of Set/Array intersect time =3D 1.44
Array | n =3D 640000 : 255466 intersects found in 0.672000 seconds
Set | n =3D 640000 : 255466 intersects found in 1.156000 seconds
Ratio of Set/Array intersect time =3D 1.72
Array | n =3D 1280000 : 511590 intersects found in 1.905000 seconds
Set | n =3D 1280000 : 511590 intersects found in 2.640000 seconds
Ratio of Set/Array intersect time =3D 1.39
Array | n =3D 2560000 : 1023031 intersects found in 4.124000 seconds
Set | n =3D 2560000 : 1023031 intersects found in 5.296000 seconds
Ratio of Set/Array intersect time =3D 1.28
Array | n =3D 5120000 : 2047355 intersects found in 8.857000 seconds
Set | n =3D 5120000 : 2047355 intersects found in 9.589000 seconds
Ratio of Set/Array intersect time =3D 1.08

--
Shourya
http://www.shouryaliv...




Robert Klemme

10/31/2008 5:19:00 PM

0

On 31.10.2008 11:21, Sarcar, Shourya C (GE Healthcare) wrote:
>> -----Original Message-----
>> From: Robert Klemme [mailto:shortcutter@googlemail.com]
>> Sent: Friday, October 31, 2008 12:04 AM
>> To: ruby-talk ML
>> Subject: Re: array comparison
>>
>> The comparison is a bit unfair since you include the build
>> time for the structure in case of Hash but not for the Array.
>> You should at least also compare just the intersection time.
>> And while you're at it you can also add Set to the mix. :-)
>
>
> Interesting discussion.
>
> I wrote a small program to compare Array intersection to Set
> intersection.
> I noticed that for smaller sets (~10 ** 3), Array is much faster that
> Set, sometimes upto 5x.
> As the set size grows, they seem to converge (~10 ** 7, 100 million)
>
> # Program to compare timings for intersections
> require 'set'
>
> n = 10000
> while (n <= 10 ** 7) do
> c = Array.new(n) {rand n}.uniq
> d = Array.new(n) {rand n}.uniq
>
> # Native array intersection
> t0 = Time.now
> cd = c & d
> t1 = Time.now
> printf("Array | n = %d : %d intersects found in %f
> seconds\n",n,cd.size,t1-t0)
>
> # Convert array to Set objects and perform intersection
> c = c.to_set
> d = d.to_set
> t2 = Time.now
> cd = c & d
> t3 = Time.now
> printf("Set | n = %d : %d intersects found in %f
> seconds\n",n,cd.size,t3-t2)
> printf("Ratio of Set/Array intersect time =
> %.2f\n",(t3-t2)/(t1-t0))
>
> n = 2 * n
> end

Some things are noteworthy about your program:

- It has a large intersection, i.e. roughly 50% on average. This may
be different for other scenarios.

- You use Fixnums which compare pretty fast, you can expect to reach
break even earlier when using other types.

I have taken the liberty to extend your test program a bit changing
those two parameters mentioned above and you can see significant
differences. :-)

Kind regards

robert


require 'set'

constr = [
lambda do |n|
[
Array.new(n) {rand n}.uniq,
Array.new(n) {rand(n) + (n*0.9).to_i}.uniq
]
end,
lambda do |n|
[
Array.new(n) {"#{rand n}" << ("*" * 100_000)}.uniq,
Array.new(n) {"#{rand(n) + (n*0.9).to_i}" << ("*" * 100_000)}.uniq
]
end
]

2.times do |run|
puts "=== run #{run + 1} ========================================================"
constr.each_with_index do |creat, idx|
puts "--- construct #{idx} --------------------------------------------------------"
n = 1
loop do
c, d = creat[n]

# Native array intersection
t0 = Time.now
cd = c & d
t1 = Time.now
printf("Array | n = %d : %d intersects found in %f seconds\n",n,cd.size,t1-t0)

# Convert array to Set objects and perform intersection
c = c.to_set
d = d.to_set
t2 = Time.now
cd = c & d
t3 = Time.now
printf("Set | n = %d : %d intersects found in %f seconds\n",n,cd.size,t3-t2)

ratio = (t3-t2)/(t1-t0)
printf("Ratio of Set/Array intersect time = %.2f\n",ratio)

n *= 2
break if n > 100 && ratio > 0.1 && ratio < 0.9
end
end
end
=== run 1 ========================================================
--- construct 0 --------------------------------------------------------
Array | n = 1 : 1 intersects found in 0.000000 seconds
Set | n = 1 : 1 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 2 : 0 intersects found in 0.000000 seconds
Set | n = 2 : 0 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 4 : 1 intersects found in 0.000000 seconds
Set | n = 4 : 1 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 8 : 0 intersects found in 0.000000 seconds
Set | n = 8 : 0 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 16 : 0 intersects found in 0.000000 seconds
Set | n = 16 : 0 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 32 : 1 intersects found in 0.000000 seconds
Set | n = 32 : 1 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 64 : 2 intersects found in 0.000000 seconds
Set | n = 64 : 2 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 128 : 3 intersects found in 0.000000 seconds
Set | n = 128 : 3 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 256 : 13 intersects found in 0.000000 seconds
Set | n = 256 : 13 intersects found in 0.001000 seconds
Ratio of Set/Array intersect time = inf
Array | n = 512 : 19 intersects found in 0.000000 seconds
Set | n = 512 : 19 intersects found in 0.001000 seconds
Ratio of Set/Array intersect time = inf
Array | n = 1024 : 38 intersects found in 0.001000 seconds
Set | n = 1024 : 38 intersects found in 0.001000 seconds
Ratio of Set/Array intersect time = 1.00
Array | n = 2048 : 87 intersects found in 0.001000 seconds
Set | n = 2048 : 87 intersects found in 0.003000 seconds
Ratio of Set/Array intersect time = 3.00
Array | n = 4096 : 147 intersects found in 0.002000 seconds
Set | n = 4096 : 147 intersects found in 0.005000 seconds
Ratio of Set/Array intersect time = 2.50
Array | n = 8192 : 344 intersects found in 0.004000 seconds
Set | n = 8192 : 344 intersects found in 0.010000 seconds
Ratio of Set/Array intersect time = 2.50
Array | n = 16384 : 674 intersects found in 0.008000 seconds
Set | n = 16384 : 674 intersects found in 0.021000 seconds
Ratio of Set/Array intersect time = 2.62
Array | n = 32768 : 1353 intersects found in 0.015000 seconds
Set | n = 32768 : 1353 intersects found in 0.046000 seconds
Ratio of Set/Array intersect time = 3.07
Array | n = 65536 : 2605 intersects found in 0.040000 seconds
Set | n = 65536 : 2605 intersects found in 0.096000 seconds
Ratio of Set/Array intersect time = 2.40
Array | n = 131072 : 5278 intersects found in 0.125000 seconds
Set | n = 131072 : 5278 intersects found in 0.203000 seconds
Ratio of Set/Array intersect time = 1.62
Array | n = 262144 : 10610 intersects found in 0.327000 seconds
Set | n = 262144 : 10610 intersects found in 0.416000 seconds
Ratio of Set/Array intersect time = 1.27
Array | n = 524288 : 21003 intersects found in 0.869000 seconds
Set | n = 524288 : 21003 intersects found in 0.827000 seconds
Ratio of Set/Array intersect time = 0.95
Array | n = 1048576 : 41912 intersects found in 2.059000 seconds
Set | n = 1048576 : 41912 intersects found in 1.674000 seconds
Ratio of Set/Array intersect time = 0.81
--- construct 1 --------------------------------------------------------
Array | n = 1 : 1 intersects found in 0.002000 seconds
Set | n = 1 : 1 intersects found in 0.001000 seconds
Ratio of Set/Array intersect time = 0.50
Array | n = 2 : 1 intersects found in 0.003000 seconds
Set | n = 2 : 1 intersects found in 0.003000 seconds
Ratio of Set/Array intersect time = 1.00
Array | n = 4 : 0 intersects found in 0.003000 seconds
Set | n = 4 : 0 intersects found in 0.001000 seconds
Ratio of Set/Array intersect time = 0.33
Array | n = 8 : 0 intersects found in 0.011000 seconds
Set | n = 8 : 0 intersects found in 0.002000 seconds
Ratio of Set/Array intersect time = 0.18
Array | n = 16 : 2 intersects found in 0.013000 seconds
Set | n = 16 : 2 intersects found in 0.008000 seconds
Ratio of Set/Array intersect time = 0.62
Array | n = 32 : 1 intersects found in 0.026000 seconds
Set | n = 32 : 1 intersects found in 0.009000 seconds
Ratio of Set/Array intersect time = 0.35
Array | n = 64 : 3 intersects found in 0.054000 seconds
Set | n = 64 : 3 intersects found in 0.020000 seconds
Ratio of Set/Array intersect time = 0.37
=== run 2 ========================================================
--- construct 0 --------------------------------------------------------
Array | n = 1 : 1 intersects found in 0.000000 seconds
Set | n = 1 : 1 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 2 : 1 intersects found in 0.000000 seconds
Set | n = 2 : 1 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 4 : 1 intersects found in 0.000000 seconds
Set | n = 4 : 1 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 8 : 1 intersects found in 0.000000 seconds
Set | n = 8 : 1 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 16 : 1 intersects found in 0.000000 seconds
Set | n = 16 : 1 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 32 : 1 intersects found in 0.000000 seconds
Set | n = 32 : 1 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 64 : 5 intersects found in 0.000000 seconds
Set | n = 64 : 5 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 128 : 7 intersects found in 0.000000 seconds
Set | n = 128 : 7 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 256 : 9 intersects found in 0.000000 seconds
Set | n = 256 : 9 intersects found in 0.000000 seconds
Ratio of Set/Array intersect time = nan
Array | n = 512 : 24 intersects found in 0.000000 seconds
Set | n = 512 : 24 intersects found in 0.001000 seconds
Ratio of Set/Array intersect time = inf
Array | n = 1024 : 47 intersects found in 0.000000 seconds
Set | n = 1024 : 47 intersects found in 0.001000 seconds
Ratio of Set/Array intersect time = inf
Array | n = 2048 : 79 intersects found in 0.000000 seconds
Set | n = 2048 : 79 intersects found in 0.003000 seconds
Ratio of Set/Array intersect time = inf
Array | n = 4096 : 178 intersects found in 0.002000 seconds
Set | n = 4096 : 178 intersects found in 0.004000 seconds
Ratio of Set/Array intersect time = 2.00
Array | n = 8192 : 322 intersects found in 0.004000 seconds
Set | n = 8192 : 322 intersects found in 0.010000 seconds
Ratio of Set/Array intersect time = 2.50
Array | n = 16384 : 650 intersects found in 0.008000 seconds
Set | n = 16384 : 650 intersects found in 0.022000 seconds
Ratio of Set/Array intersect time = 2.75
Array | n = 32768 : 1326 intersects found in 0.015000 seconds
Set | n = 32768 : 1326 intersects found in 0.046000 seconds
Ratio of Set/Array intersect time = 3.07
Array | n = 65536 : 2626 intersects found in 0.041000 seconds
Set | n = 65536 : 2626 intersects found in 0.095000 seconds
Ratio of Set/Array intersect time = 2.32
Array | n = 131072 : 5279 intersects found in 0.129000 seconds
Set | n = 131072 : 5279 intersects found in 0.205000 seconds
Ratio of Set/Array intersect time = 1.59
Array | n = 262144 : 10491 intersects found in 0.334000 seconds
Set | n = 262144 : 10491 intersects found in 0.403000 seconds
Ratio of Set/Array intersect time = 1.21
Array | n = 524288 : 21031 intersects found in 1.112000 seconds
Set | n = 524288 : 21031 intersects found in 0.814000 seconds
Ratio of Set/Array intersect time = 0.73
--- construct 1 --------------------------------------------------------
Array | n = 1 : 1 intersects found in 0.001000 seconds
Set | n = 1 : 1 intersects found in 0.002000 seconds
Ratio of Set/Array intersect time = 2.00
Array | n = 2 : 1 intersects found in 0.003000 seconds
Set | n = 2 : 1 intersects found in 0.002000 seconds
Ratio of Set/Array intersect time = 0.67
Array | n = 4 : 1 intersects found in 0.004000 seconds
Set | n = 4 : 1 intersects found in 0.002000 seconds
Ratio of Set/Array intersect time = 0.50
Array | n = 8 : 0 intersects found in 0.007000 seconds
Set | n = 8 : 0 intersects found in 0.002000 seconds
Ratio of Set/Array intersect time = 0.29
Array | n = 16 : 0 intersects found in 0.013000 seconds
Set | n = 16 : 0 intersects found in 0.005000 seconds
Ratio of Set/Array intersect time = 0.38
Array | n = 32 : 1 intersects found in 0.024000 seconds
Set | n = 32 : 1 intersects found in 0.009000 seconds
Ratio of Set/Array intersect time = 0.37
Array | n = 64 : 3 intersects found in 0.055000 seconds
Set | n = 64 : 3 intersects found in 0.020000 seconds
Ratio of Set/Array intersect time = 0.36