Asp Forum - heavy loop functions slow

Michael Linfield

4/8/2008 5:17:00 AM

Alright so I was playing with my large amounts of data and ran into yet
another problem with shoving it into a loop that requires a substantial
amount of memory.

dataArray = []
output = arrayOut.to_s.chop!.split(",")

output1 = output[0..356130]
output2 = output[356131..712260]
output3 = output[712261..1068390]
output4 = output[1068391..1424521]

count = 0
output1.each do |out|
out = out.to_i
push = hashRange[out]
dataArray << push
count+=1
puts "#{push} - #{count}" #Testing purposes
end

I broke 'output' up into several blocks for other purposes than just
this loop, but also to see what the effect would be. As you can see
we're talkin about almost 1,500,000 array elements.
-->hashRange is a hash obviously

Problem being: that test line I added 'puts "#{push} - #{count}"'
solidifies the fact that it moves through 1 element every 5-6sec...
After doing my math thats about 86 days to finish 1,500,000 elements :(

Any ideas that would speed this up are much appreciated!! Otherwise I'll
be back in 3 months IF I dont get an error :D

Thanks,

- Mac
--
Posted via http://www.ruby-....

12 Answers

Paul McMahon

4/8/2008 6:17:00 AM

Try further benchmarking what causes the slowness. Isolate what code is
causing the slowness itself. Also, without knowing what hashRange and
output contain, it is not obvious where the slowness comes from. For
instance if hashRange = {} and output = (0..1_000_000).to_a, this code
takes relatively little time to execute.

Jano Svitok

4/8/2008 7:44:00 AM

On Tue, Apr 8, 2008 at 7:17 AM, Michael Linfield <globyy3000@hotmail.com> wrote:
> Alright so I was playing with my large amounts of data and ran into yet
> another problem with shoving it into a loop that requires a substantial
> amount of memory.
>
>
>
> dataArray = []
> output = arrayOut.to_s.chop!.split(",")

set arrayOut to nil if you don't need it any more.

> output1 = output[0..356130]
> output2 = output[356131..712260]
> output3 = output[712261..1068390]
> output4 = output[1068391..1424521]

You dont need output here, set it to nil to allow for garbage collection

> count = 0
> output1.each do |out|
> out = out.to_i
> push = hashRange[out]
> dataArray << push
> count+=1
> puts "#{push} - #{count}" #Testing purposes
> end

1. you can convert the output to numbers in one pass, though use
benchmark to see the actual gain:

output = arrayOut.to_s.chop!.split(",").map {|out| out.to_i }

2. if you are looking for numbers only, you can do something like

output = []
arrayOut.to_s.chop!.scan(/\d+/) {|out| output << out.to_i }
(you can count the items, and switch to output2 when output1 has
enough, thus 1. creating smaller arrays, 2. doing two things in one
step.)

3. even in this case, you still have both the original arrayOut and
the long string (.to_s) in memory.
It might be faster, if you could iterate through the array without
creating the intermediate string. The question is 1. will it help? 2.
Is it worth it?

Robert Klemme

4/8/2008 8:12:00 AM

2008/4/8, Michael Linfield <globyy3000@hotmail.com>:
> Alright so I was playing with my large amounts of data and ran into yet
> another problem with shoving it into a loop that requires a substantial
> amount of memory.
>
>
>
> dataArray = []
> output = arrayOut.to_s.chop!.split(",")
>
> output1 = output[0..356130]
> output2 = output[356131..712260]
> output3 = output[712261..1068390]
> output4 = output[1068391..1424521]
>
> count = 0
> output1.each do |out|
> out = out.to_i
> push = hashRange[out]
> dataArray << push
> count+=1
> puts "#{push} - #{count}" #Testing purposes
> end
>
> I broke 'output' up into several blocks for other purposes than just
> this loop, but also to see what the effect would be. As you can see
> we're talkin about almost 1,500,000 array elements.
> -->hashRange is a hash obviously
>
> Problem being: that test line I added 'puts "#{push} - #{count}"'
> solidifies the fact that it moves through 1 element every 5-6sec...
> After doing my math thats about 86 days to finish 1,500,000 elements :(
>
> Any ideas that would speed this up are much appreciated!! Otherwise I'll
> be back in 3 months IF I dont get an error :D

Obviously there is a lot of code missing from the piece above. Can
you explain, what you are trying to achieve? What is your input file
format and what kind of transformation do you want to do on it? I
looked through your other postings but it did not become clear to me.

Cheers

robert

--
use.inject do |as, often| as.you_can - without end

Michael Linfield

4/8/2008 9:32:00 AM

Robert Klemme wrote:
> 2008/4/8, Michael Linfield <globyy3000@hotmail.com>:
>> output2 = output[356131..712260]
>> end
>> Any ideas that would speed this up are much appreciated!! Otherwise I'll
>> be back in 3 months IF I dont get an error :D
>
> Obviously there is a lot of code missing from the piece above. Can
> you explain, what you are trying to achieve? What is your input file
> format and what kind of transformation do you want to do on it? I
> looked through your other postings but it did not become clear to me.
>
> Cheers
>
> robert

Alright heres the breakdown of everything.

dataArray = []

# arrayOut consist of all integer data stored in a text file.
# its called upon via IO.foreach("data.txt"){|x| dataArray << x}
# dataArray being just a predefined array ie: dataArray = []

output = arrayOut.to_s.chop!.split(",")

#Each of these outputs breaks down this huge array into 4 smaller arrays
output1 = output[0..356130]
output2 = output[356131..712260]
output3 = output[712261..1068390]
output4 = output[1068391..1424521]

#hashRange[out] is basically calling a hash in the following context.
# hash = { 1=> { 20000..30000 => 12345 } }
#so 'out' is calling the range of the key to which contains its defined
value
#basically its saying hashRange[25000] #=> 12345 as an example

#everything imported to dataArray is a string, so it must be converted
to an
#integer to correctly match the range key

#after benchmarking some elements of the loop below its found to be
#the push = hashRange[out] is whats slowing everything down.
#everything a nil 'out' is shoved into the query it takes about 8sec.
#when its a correct number, takes about 5sec

#the hashRange file is about 78mb, to which I had to load in as
#8 separate data files, then shove those into an eval to convert it
#to a hash

count = 0
output1.each do |out|
out = out.to_i
push = hashRange[out]
dataArray << push
count+=1
puts "#{push} - #{count}" #Testing purposes
end

#I guess what I need now is a faster way to access this pre-defined
hash.
#SQL is one possibility but that could be considered a whole other
#forum post :)

Any other questions feel free to ask,
Your guy's insight is much appreciated.

Thanks again,

- Mac
--
Posted via http://www.ruby-....

Robert Klemme

4/8/2008 12:57:00 PM

2008/4/8, Michael Linfield <globyy3000@hotmail.com>:
> Robert Klemme wrote:
> > 2008/4/8, Michael Linfield <globyy3000@hotmail.com>:
>
> >> output2 = output[356131..712260]
> >> end
>
> >> Any ideas that would speed this up are much appreciated!! Otherwise I'll
> >> be back in 3 months IF I dont get an error :D
> >
> > Obviously there is a lot of code missing from the piece above. Can
> > you explain, what you are trying to achieve? What is your input file
> > format and what kind of transformation do you want to do on it? I
> > looked through your other postings but it did not become clear to me.
> >
> > Cheers
> >
> > robert
>
>
> Alright heres the breakdown of everything.
>
>
> dataArray = []
>
> # arrayOut consist of all integer data stored in a text file.
> # its called upon via IO.foreach("data.txt"){|x| dataArray << x}
> # dataArray being just a predefined array ie: dataArray = []
>
>
> output = arrayOut.to_s.chop!.split(",")
>
>
> #Each of these outputs breaks down this huge array into 4 smaller arrays
>
> output1 = output[0..356130]
> output2 = output[356131..712260]
> output3 = output[712261..1068390]
> output4 = output[1068391..1424521]
>
>
> #hashRange[out] is basically calling a hash in the following context.
> # hash = { 1=> { 20000..30000 => 12345 } }
> #so 'out' is calling the range of the key to which contains its defined
> value
> #basically its saying hashRange[25000] #=> 12345 as an example
>
> #everything imported to dataArray is a string, so it must be converted
> to an
> #integer to correctly match the range key
>
> #after benchmarking some elements of the loop below its found to be
> #the push = hashRange[out] is whats slowing everything down.
> #everything a nil 'out' is shoved into the query it takes about 8sec.
> #when its a correct number, takes about 5sec
>
> #the hashRange file is about 78mb, to which I had to load in as
> #8 separate data files, then shove those into an eval to convert it
> #to a hash
>
>
> count = 0
> output1.each do |out|
> out = out.to_i
> push = hashRange[out]
> dataArray << push
> count+=1
> puts "#{push} - #{count}" #Testing purposes
> end
>
>
> #I guess what I need now is a faster way to access this pre-defined
> hash.
> #SQL is one possibility but that could be considered a whole other
> #forum post :)
>
> Any other questions feel free to ask,
> Your guy's insight is much appreciated.

Let's see whether I understood correctly: you have a file with
multiple integer numbers per line. You have defined a range mapping,
i.e. each interval an int can be in has a label. You want to read in
all ints and output their labels.

If this is correct, this is what I'd do:

$ ruby -e '20.times {|i| puts i}' >| x
14:54:37 /c/Temp
$ ./rl.rb x
low
low
medium
medium
medium
high
high
high
high
high
no label
no label
no label
no label
no label
no label
no label
no label
no label
no label
14:54:41 /c/Temp
$ cat rl.rb
#!/bin/env ruby

class RangeLabels
def initialize(labels)
@labels = labels.sort_by {|key,lab| key}
end

def lookup(val)
# slow, this can be improved by binary search!
@labels.each do |key, lab|
return lab if val < key
end
"no label"
end
end

rl = RangeLabels.new [
[2, "low"],
[5, "medium"],
[10, "high"],
]

ARGF.each do |line|
first = true
line.scan /\d+/ do |val|
if first
first = false
else
print ", "
end

print rl.lookup(val.to_i)
end

print "\n"
end
14:54:52 /c/Temp
$

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Michael Linfield

4/8/2008 10:30:00 PM

That would work, but even with marshal dumping the data set is just too
large for memory to handle quickly. I think I'm going to move the
project over to PostgreSQL and see if that doesn't speed things up a
considerable amount, Thanks Robert.

- Mac
--
Posted via http://www.ruby-....

Robert Klemme

4/9/2008 6:15:00 AM

On 09.04.2008 00:30, Michael Linfield wrote:
> That would work, but even with marshal dumping the data set is just too
> large for memory to handle quickly.

Which data set - the range definitions or the output? I thought this is
a one off process that transforms a large input file into a large output
file.

> I think I'm going to move the
> project over to PostgreSQL and see if that doesn't speed things up a
> considerable amount, Thanks Robert.

That's of course an option. But I still feel kind of at a loss about
what exactly you are doing. Is this just a single processing step in a
much larger application?

Cheers

robert

Michael Linfield

4/9/2008 5:53:00 PM

Robert Klemme wrote:
> On 09.04.2008 00:30, Michael Linfield wrote:
>> That would work, but even with marshal dumping the data set is just too
>> large for memory to handle quickly.
>
> Which data set - the range definitions or the output? I thought this is
> a one off process that transforms a large input file into a large output
> file.
>
>> I think I'm going to move the
>> project over to PostgreSQL and see if that doesn't speed things up a
>> considerable amount, Thanks Robert.
>
> That's of course an option. But I still feel kind of at a loss about
> what exactly you are doing. Is this just a single processing step in a
> much larger application?
>
> Cheers
>
> robert

The dump would be to the pre-defined hash, to hence retrieve the
information faster.

To answer your 2nd question yes this is just a single step in a very
large 12 step application. I'm hoping to condense it down to about 8
steps when I finish. This step alone involves transforming a large
dataset into a smaller dataset.

I'm trying to extract all the numbers between ranges and push the keys
of the hash results into a file. This file will then be opened by
another part of the step process to be analyzed.

IE:
if the transformation involved the file of:
12345
67423
97567
45345
ect.
I would want to pull all of those numbers and get the keys for those
hash ranges
IE:
12000..15000 => 100
60000..70000 => 250
ect.

So 12345 would fall in the range of 12000.15000 so the output file would
get 100 added to it. Then the next step would be analyzing the results
(IE: 100).
Hope this explains things a bit better.

Thanks,

- Mac
--
Posted via http://www.ruby-....

Robert Klemme

4/9/2008 7:45:00 PM

On 09.04.2008 19:53, Michael Linfield wrote:
> Robert Klemme wrote:
>> On 09.04.2008 00:30, Michael Linfield wrote:
>>> That would work, but even with marshal dumping the data set is just too
>>> large for memory to handle quickly.
>> Which data set - the range definitions or the output? I thought this is
>> a one off process that transforms a large input file into a large output
>> file.
>>
>>> I think I'm going to move the
>>> project over to PostgreSQL and see if that doesn't speed things up a
>>> considerable amount, Thanks Robert.
>> That's of course an option. But I still feel kind of at a loss about
>> what exactly you are doing. Is this just a single processing step in a
>> much larger application?
>
> The dump would be to the pre-defined hash, to hence retrieve the
> information faster.

I would not use the term "hash" here because this is an implementation
detail. Basically what you want to store is the mapping from input
numbers mapped to output numbers via ranges, don't you?

> To answer your 2nd question yes this is just a single step in a very
> large 12 step application. I'm hoping to condense it down to about 8
> steps when I finish. This step alone involves transforming a large
> dataset into a smaller dataset.
>
> I'm trying to extract all the numbers between ranges and push the keys
> of the hash results into a file. This file will then be opened by
> another part of the step process to be analyzed.
>
> IE:
> if the transformation involved the file of:
> 12345
> 67423
> 97567
> 45345
> ect.
> I would want to pull all of those numbers and get the keys for those
> hash ranges
> IE:
> 12000..15000 => 100
> 60000..70000 => 250
> ect.

How many of those ranges do you have? Is there any mathematical
relation between each input range and its output value?

> So 12345 would fall in the range of 12000.15000 so the output file would
> get 100 added to it. Then the next step would be analyzing the results
> (IE: 100).

So let me rephrase it to make sure I understood properly: you are
reading a amount of numbers and map each number to another one (via
ranges). Mapped numbers are input to the next processing stage. It
seems you would want to output each mapped value only once; this
immediately suggests set semantic.

> Hope this explains things a bit better.

Yes, we're getting there. :-) Actually I find this a nice exercise in
requirements extrapolation. In this case I try to extract the
requirements from you (aka the customer). :-)

Kind regards

robert

How about

#!/bin/env ruby

require 'set'

class RangeLabels
def initialize(labels, fallback = nil)
@labels = labels.sort_by {|key,lab| key}
@fallback = fallback
end

def lookup(val)
# slow if there are many ranges
# this can be improved by binary search!
@labels.each do |key, lab|
return lab if val < key
end
@fallback
end

alias [] lookup
end

rl = RangeLabels.new [
[12000, 50],
[15000, 100],
[60000, nil],
[70000, 250],
]

output = Set.new

ARGF.each do |line|
line.scan /\d+/ do |val|
x = rl[val.to_i] and output << x
end
end

puts output.to_a

ara.t.howard

4/9/2008 9:28:00 PM

> The dump would be to the pre-defined hash, to hence retrieve the
> information faster.
>
> To answer your 2nd question yes this is just a single step in a very
> large 12 step application. I'm hoping to condense it down to about 8
> steps when I finish. This step alone involves transforming a large
> dataset into a smaller dataset.
>
> I'm trying to extract all the numbers between ranges and push the keys
> of the hash results into a file. This file will then be opened by
> another part of the step process to be analyzed.
>
> IE:
> if the transformation involved the file of:
> 12345
> 67423
> 97567
> 45345
> ect.
> I would want to pull all of those numbers and get the keys for those
> hash ranges
> IE:
> 12000..15000 => 100
> 60000..70000 => 250
> ect.
>
> So 12345 would fall in the range of 12000.15000 so the output file
> would
> get 100 added to it. Then the next step would be analyzing the results
> (IE: 100).
> Hope this explains things a bit better.
>
> Thanks,
>

cfp:~ > cat a.rb
#
# use narray for fast ruby numbers
#
require 'rubygems'
require 'narray'

#
# ton-o-date
#
huge = NArray.int(2 ** 25).indgen * 100 # 0, 100, 200, 300, etc

#
# bin data
#
# 0...100 -> 0
# 100...200 -> 1
# 200...300 -> 2
# etc...
#

a = Time.now.to_f

p huge

huge.div! 100 # 42 -> 0, 127 -> 1, 2227 -> 222

b = Time.now.to_f

elapsed = b - a

p elapsed

p huge

cfp:~ > ruby a.rb
NArray.int(33554432):
[ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200,
1300, ... ]
0.202844142913818
NArray.int(33554432):
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, ... ]

so that's doing about 33 million elements in around 2/10ths of a
second....

a @ http://codeforp...
--
we can deny everything, except that we have the possibility of being
better. simply reflect on that.
h.h. the 14th dalai lama

comp.lang.ruby

heavy loop functions slow

Michael Linfield

Paul McMahon

Jano Svitok

Robert Klemme

Michael Linfield

Robert Klemme

Michael Linfield

Robert Klemme

Michael Linfield

Robert Klemme

ara.t.howard

x Login to ForumsZone