[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Enumerable#serially - those nifty functions w/o memory footprint

SonOfLilit

5/30/2007 1:35:00 PM

Hello,

I've seen many people write things like:

(1..1000000).to_a. # anything that is a huge enumerable can work as an example
collect{|x| num_appearances(x)}.select{|x|
x.is_norwegian?}.collect{|x| x.paycheck}.each{|x| p x}

I almost wrote something like that myself today.

The problem, of course, is the huge memory footprint - collect,
select, collect each creates a new temporary array to store the
results in.

Here's a solution to that problem, exchanging it for a bit of speed
(anything that uses those methods could obviously live with that,
otherwise another language or method would be used):

module Enumerable
def serially(&b)
self.collect{|x| *([x].instance_eval(&b)}
end
end

Just beware not to use it with sort or any of the "!" methods or with
sort or sort_by.

Aur

12 Answers

SonOfLilit

5/30/2007 1:47:00 PM

0

Sorry, found a bug, I'll need to make it into this:

module Enumerable
def serially(&b)
a = []
self.each{|x| t = ([x].instance_eval(&b); a << b[0] unless b.empty?}
# notice, relevance to "it" discussion that's currently going on, just btw
end
end


On 5/30/07, SonOfLilit <sonoflilit@gmail.com> wrote:
> Hello,
>
> I've seen many people write things like:
>
> (1..1000000).to_a. # anything that is a huge enumerable can work as an example
> collect{|x| num_appearances(x)}.select{|x|
> x.is_norwegian?}.collect{|x| x.paycheck}.each{|x| p x}
>
> I almost wrote something like that myself today.
>
> The problem, of course, is the huge memory footprint - collect,
> select, collect each creates a new temporary array to store the
> results in.
>
> Here's a solution to that problem, exchanging it for a bit of speed
> (anything that uses those methods could obviously live with that,
> otherwise another language or method would be used):
>
> module Enumerable
> def serially(&b)
> self.collect{|x| *([x].instance_eval(&b)}
> end
> end
>
> Just beware not to use it with sort or any of the "!" methods or with
> sort or sort_by.
>
> Aur
>
>

Robert Klemme

5/30/2007 1:56:00 PM

0

On 30.05.2007 15:35, SonOfLilit wrote:
> Hello,
>
> I've seen many people write things like:
>
> (1..1000000).to_a. # anything that is a huge enumerable can work as an
> example
> collect{|x| num_appearances(x)}.select{|x|
> x.is_norwegian?}.collect{|x| x.paycheck}.each{|x| p x}

Just guessing what all this does, but there is of course the obvious
solution:

some_large_enum.each do |x|
x = num_appearances(x)
p x.paycheck if x.norwegian?
end

> I almost wrote something like that myself today.
>
> The problem, of course, is the huge memory footprint - collect,
> select, collect each creates a new temporary array to store the
> results in.

Yes, that's really an issue.

> Here's a solution to that problem, exchanging it for a bit of speed
> (anything that uses those methods could obviously live with that,
> otherwise another language or method would be used):
>
> module Enumerable
> def serially(&b)
> self.collect{|x| *([x].instance_eval(&b)}
> end
> end
>
> Just beware not to use it with sort or any of the "!" methods or with
> sort or sort_by.

I am not exactly sure I understand what this is supposed to do.´ This
isn't even compiling properly.

Regards

robert

Brian Candler

5/30/2007 2:02:00 PM

0

On Wed, May 30, 2007 at 10:46:33PM +0900, SonOfLilit wrote:
> Sorry, found a bug, I'll need to make it into this:
>
> module Enumerable
> def serially(&b)
> a = []
> self.each{|x| t = ([x].instance_eval(&b); a << b[0] unless b.empty?}
> # notice, relevance to "it" discussion that's currently going on, just
> btw
> end
> end

Somehow I don't think you tested this :-)

* You have mismatched parentheses
* You assign to t, but never use the value
* you call b twice, once with no arguments, and once with 0 as an argument
* b is a block, but you call #empty? on it

Can you give an example of how this is supposed to be used?

Brian.

Brad Phelan

5/30/2007 5:18:00 PM

0

Brian Candler wrote:
> On Wed, May 30, 2007 at 10:46:33PM +0900, SonOfLilit wrote:
>> Sorry, found a bug, I'll need to make it into this:
>>
>> module Enumerable
>> def serially(&b)
>> a = []
>> self.each{|x| t = ([x].instance_eval(&b); a << b[0] unless b.empty?}
>> # notice, relevance to "it" discussion that's currently going on, just
>> btw
>> end
>> end
>
> Somehow I don't think you tested this :-)
>
> * You have mismatched parentheses
> * You assign to t, but never use the value
> * you call b twice, once with no arguments, and once with 0 as an argument
> * b is a block, but you call #empty? on it
>
> Can you give an example of how this is supposed to be used?
>
> Brian.
>

I think I get his point. I wrote a little lib to try out my
interpretation of what he is trying to do.


a = (1..20)

b = a.serially do
select do |x|
if x > 10
true
else
false
end
end
collect do |x|
"0x" + x.to_s
end
select do |x|
if x == "0x15"
false
else
true
end
end
end

puts b



# generates

0x11
0x12
0x13
0x14
0x16
0x17
0x18
0x19
0x20

The lib is at

http://xtargets.co.../pos...

--
Brad Phelan
http://xtargets.co...

Brad Phelan

5/30/2007 5:51:00 PM

0

Brad Phelan wrote:
> Brian Candler wrote:
>> On Wed, May 30, 2007 at 10:46:33PM +0900, SonOfLilit wrote:
>>> Sorry, found a bug, I'll need to make it into this:
>>>
>>> module Enumerable
>>> def serially(&b)
>>> a = []
>>> self.each{|x| t = ([x].instance_eval(&b); a << b[0] unless b.empty?}
>>> # notice, relevance to "it" discussion that's currently going on,
>>> just btw
>>> end
>>> end
>>
>> Somehow I don't think you tested this :-)
>>
>> * You have mismatched parentheses
>> * You assign to t, but never use the value
>> * you call b twice, once with no arguments, and once with 0 as an
>> argument
>> * b is a block, but you call #empty? on it
>>
>> Can you give an example of how this is supposed to be used?
>>
>> Brian.
>>
>
> I think I get his point. I wrote a little lib to try out my
> interpretation of what he is trying to do.
>
>
> a = (1..20)
>
> b = a.serially do
> select do |x|
> if x > 10
> true
> else
> false
> end
> end
> collect do |x|
> "0x" + x.to_s
> end
> select do |x|
> if x == "0x15"
> false
> else
> true
> end
> end
> end
>
> puts b
>
>
>
> # generates
>
> 0x11
> 0x12
> 0x13
> 0x14
> 0x16
> 0x17
> 0x18
> 0x19
> 0x20
>
> The lib is at
>
> http://xtargets.co.../pos...
>
> --
> Brad Phelan
> http://xtargets.co...
>

A quick profile with ruby-prof find the above technique about 10 times
slower in 50000 elements than for the traditional method.

--
Brad Phelan
http://xt...

Brad Phelan

5/30/2007 6:21:00 PM

0

Brad Phelan wrote:
> Brad Phelan wrote:
>> Brian Candler wrote:
>>> On Wed, May 30, 2007 at 10:46:33PM +0900, SonOfLilit wrote:
>>>> Sorry, found a bug, I'll need to make it into this:
>>>>
>>>> module Enumerable
>>>> def serially(&b)
>>>> a = []
>>>> self.each{|x| t = ([x].instance_eval(&b); a << b[0] unless b.empty?}
>>>> # notice, relevance to "it" discussion that's currently going on,
>>>> just btw
>>>> end
>>>> end
>>>
>>> Somehow I don't think you tested this :-)
>>>
>>> * You have mismatched parentheses
>>> * You assign to t, but never use the value
>>> * you call b twice, once with no arguments, and once with 0 as an
>>> argument
>>> * b is a block, but you call #empty? on it
>>>
>>> Can you give an example of how this is supposed to be used?
>>>
>>> Brian.
>>>
>>
>> I think I get his point. I wrote a little lib to try out my
>> interpretation of what he is trying to do.
>>
>>
>> a = (1..20)
>>
>> b = a.serially do
>> select do |x|
>> if x > 10
>> true
>> else
>> false
>> end
>> end
>> collect do |x|
>> "0x" + x.to_s
>> end
>> select do |x|
>> if x == "0x15"
>> false
>> else
>> true
>> end
>> end
>> end
>>
>> puts b
>>
>>
>>
>> # generates
>>
>> 0x11
>> 0x12
>> 0x13
>> 0x14
>> 0x16
>> 0x17
>> 0x18
>> 0x19
>> 0x20
>>
>> The lib is at
>>
>> http://xtargets.co.../pos...
>>
>> --
>> Brad Phelan
>> http://xtargets.co...
>>
>
> A quick profile with ruby-prof find the above technique about 10 times
> slower in 50000 elements than for the traditional method.
>
> --
> Brad Phelan
> http://xt...

This little bit of code has got me interested. Turns out I can create a
version with comparable time execution by slicing the the input
enumerable into memory friendly chunks and then performing all
operations on those chunks. This kind of reminds me of C++
expression templating tricks

See the link

http://xtargets.co.../pos...

for the latest highspeed version and the ruby-prof results

--
Brad Phelan
http://xt...

SonOfLilit

5/30/2007 7:23:00 PM

0

module Enumerable
def serially(&b)
a = []
self.each{|x| t = [x].instance_eval(&b); a << t[0] unless t.empty?}
a
end
end

p (0..20).serially{select{|x| x>10}.collect{|x|"0x" <<
x.to_s}.select{|x| x != "0x15"}}

#=> ["0x11", "0x12", "0x13", "0x14", "0x16", "0x17", "0x18", "0x19", "0x20"]

The original version was written in another computer, and re-typing
wuthout testing took it's toll. I'm really sorry for wasting all of
your times.

On the other hand, I seem to have inspired someone to think of a
really clever hack, that is exactly orthogonal to my method:

While I call all methods on each member of the enumerable,
incrementally building the results array (a bit like using Generator
iterators, but without the continuations :P), Brad evaluates each
function in the chain on 100 members at a time, building the array for
the next function. If he'd use slices of size 1, our code would be
very similar in idea. My attempt to combine our methods ended in
finding that my installed Enumerable implementation is braindead,
probably a collision between the 1.8.5 and 1.8.2 ruby versions
installed here. I don't care enough to go into it right now.

Profiling data is on the way and will be here soon.

Aur

On 5/30/07, Brad Phelan <phelan@tttech.ttt> wrote:
> Brad Phelan wrote:
> > Brad Phelan wrote:
> >> Brian Candler wrote:
> >>> On Wed, May 30, 2007 at 10:46:33PM +0900, SonOfLilit wrote:
> >>>> Sorry, found a bug, I'll need to make it into this:
> >>>>
> >>>> module Enumerable
> >>>> def serially(&b)
> >>>> a = []
> >>>> self.each{|x| t = ([x].instance_eval(&b); a << b[0] unless b.empty?}
> >>>> # notice, relevance to "it" discussion that's currently going on,
> >>>> just btw
> >>>> end
> >>>> end
> >>>
> >>> Somehow I don't think you tested this :-)
> >>>
> >>> * You have mismatched parentheses
> >>> * You assign to t, but never use the value
> >>> * you call b twice, once with no arguments, and once with 0 as an
> >>> argument
> >>> * b is a block, but you call #empty? on it
> >>>
> >>> Can you give an example of how this is supposed to be used?
> >>>
> >>> Brian.
> >>>
> >>
> >> I think I get his point. I wrote a little lib to try out my
> >> interpretation of what he is trying to do.
> >>
> >>
> >> a = (1..20)
> >>
> >> b = a.serially do
> >> select do |x|
> >> if x > 10
> >> true
> >> else
> >> false
> >> end
> >> end
> >> collect do |x|
> >> "0x" + x.to_s
> >> end
> >> select do |x|
> >> if x == "0x15"
> >> false
> >> else
> >> true
> >> end
> >> end
> >> end
> >>
> >> puts b
> >>
> >>
> >>
> >> # generates
> >>
> >> 0x11
> >> 0x12
> >> 0x13
> >> 0x14
> >> 0x16
> >> 0x17
> >> 0x18
> >> 0x19
> >> 0x20
> >>
> >> The lib is at
> >>
> >> http://xtargets.co.../pos...
> >>
> >> --
> >> Brad Phelan
> >> http://xtargets.co...
> >>
> >
> > A quick profile with ruby-prof find the above technique about 10 times
> > slower in 50000 elements than for the traditional method.
> >
> > --
> > Brad Phelan
> > http://xt...
>
> This little bit of code has got me interested. Turns out I can create a
> version with comparable time execution by slicing the the input
> enumerable into memory friendly chunks and then performing all
> operations on those chunks. This kind of reminds me of C++
> expression templating tricks
>
> See the link
>
> http://xtargets.co.../pos...
>
> for the latest highspeed version and the ruby-prof results
>
> --
> Brad Phelan
> http://xt...
>
>

SonOfLilit

5/30/2007 7:55:00 PM

0

Following is ruby-prof output. First, a few words:

4.75 against 1.56 isn't too good, but it isn't ten times slower. I
think it can be excused (and maybe the code can be optimized a bit) in
cases where you'd process huge lists with ruby of all languages.

This is the code used to profile:

require 'rubygems'
require 'ruby-prof'

module Enumerable
def serially(&b)
a = []
self.each{|x| t = [x].instance_eval(&b); a << t[0] unless t.empty?}
a
end
end

# Profile the code
result = RubyProf.profile{
a = (1..50000)
b = a.serially{select{|x| x>280}.collect{|x|"0x" <<
x.to_s}.select{|x| x != "0x15"}}
}

# Print a graph profile to text
printer = RubyProf::GraphPrinter.new(result)
printer.print(STDOUT, 0)

result = RubyProf.profile{
a = (1..50000)
b = a.select{|x| x>280}.collect{|x|"0x" << x.to_s}.select{|x| x != "0x15"}
}

# Print a graph profile to text
printer = RubyProf::GraphPrinter.new(result)
printer.print(STDOUT, 0)

%total %self total self children calls Name
--------------------------------------------------------------------------------
0.16 0.16 0.00 49720/49720
Array#select
3.37% 3.37% 0.16 0.16 0.00 49720 String#==
--------------------------------------------------------------------------------
0.19 0.19 0.00 49720/49720
Array#collect
4.00% 4.00% 0.19 0.19 0.00 49720 String#<<
--------------------------------------------------------------------------------
4.75 0.79 3.96 1/1
Enumerable#serially
100.00% 16.63% 4.75 0.79 3.96 1
Range#each
3.46 1.01 2.45 50000/50000
Kernel#instance_eval
0.18 0.18 0.00 49720/49720 Array#[]
0.15 0.15 0.00 50000/50000
Array#empty?
0.17 0.17 0.00 49720/49720 Array#<<
--------------------------------------------------------------------------------
3.46 1.01 2.45 50000/50000
Range#each
72.84% 21.26% 3.46 1.01 2.45 50000
Kernel#instance_eval
1.08 0.65 0.43 50000/50000
Array#collect
1.37 1.02 0.35 100000/100000
Array#select
--------------------------------------------------------------------------------
0.24 0.24 0.00 49720/49720
Array#collect
5.05% 5.05% 0.24 0.24 0.00 49720
Fixnum#to_s
--------------------------------------------------------------------------------
0.19 0.19 0.00 50000/50000
Array#select
4.00% 4.00% 0.19 0.19 0.00 50000 Fixnum#>
--------------------------------------------------------------------------------
4.75 0.00 4.75 1/1 #toplevel
100.00% 0.00% 4.75 0.00 4.75 1
Enumerable#serially
4.75 0.79 3.96 1/1
Range#each
--------------------------------------------------------------------------------
1.37 1.02 0.35 100000/100000
Kernel#instance_eval
28.84% 21.47% 1.37 1.02 0.35 100000
Array#select
0.16 0.16 0.00 49720/49720 String#==
0.19 0.19 0.00 50000/50000 Fixnum#>
--------------------------------------------------------------------------------
0.15 0.15 0.00 50000/50000
Range#each
3.16% 3.16% 0.15 0.15 0.00 50000
Array#empty?
--------------------------------------------------------------------------------
1.08 0.65 0.43 50000/50000
Kernel#instance_eval
22.74% 13.68% 1.08 0.65 0.43 50000
Array#collect
0.19 0.19 0.00 49720/49720 String#<<
0.24 0.24 0.00 49720/49720
Fixnum#to_s
--------------------------------------------------------------------------------
0.18 0.18 0.00 49720/49720
Range#each
3.79% 3.79% 0.18 0.18 0.00 49720 Array#[]
--------------------------------------------------------------------------------
0.17 0.17 0.00 49720/49720
Range#each
3.58% 3.58% 0.17 0.17 0.00 49720 Array#<<
--------------------------------------------------------------------------------
100.00% 0.00% 4.75 0.00 4.75 1 #toplevel
4.75 0.00 4.75 1/1
Enumerable#serially


Thread ID: 1020900
%total %self total self children calls Name
--------------------------------------------------------------------------------
0.16 0.16 0.00 49720/49720
Array#select
10.00% 10.00% 0.16 0.16 0.00 49720 String#==
--------------------------------------------------------------------------------
0.14 0.14 0.00 49720/49720
Array#collect
8.75% 8.75% 0.14 0.14 0.00 49720 String#<<
--------------------------------------------------------------------------------
0.41 0.22 0.19 1/1
Enumerable#select
25.62% 13.75% 0.41 0.22 0.19 1
Range#each
0.19 0.19 0.00 50000/50000 Fixnum#>
--------------------------------------------------------------------------------
0.24 0.24 0.00 49720/49720
Array#collect
15.00% 15.00% 0.24 0.24 0.00 49720
Fixnum#to_s
--------------------------------------------------------------------------------
0.19 0.19 0.00 50000/50000
Range#each
11.88% 11.88% 0.19 0.19 0.00 50000 Fixnum#>
--------------------------------------------------------------------------------
0.41 0.00 0.41 1/1 #toplevel
25.62% 0.00% 0.41 0.00 0.41 1
Enumerable#select
0.41 0.22 0.19 1/1
Range#each
--------------------------------------------------------------------------------
0.38 0.22 0.16 1/1 #toplevel
23.75% 13.75% 0.38 0.22 0.16 1
Array#select
0.16 0.16 0.00 49720/49720 String#==
--------------------------------------------------------------------------------
0.81 0.43 0.38 1/1 #toplevel
50.62% 26.88% 0.81 0.43 0.38 1
Array#collect
0.14 0.14 0.00 49720/49720 String#<<
0.24 0.24 0.00 49720/49720
Fixnum#to_s
--------------------------------------------------------------------------------
100.00% 0.00% 1.60 0.00 1.60 1 #toplevel
0.41 0.00 0.41 1/1
Enumerable#select
0.81 0.43 0.38 1/1
Array#collect
0.38 0.22 0.16 1/1
Array#select

Robert Klemme

5/31/2007 8:43:00 AM

0

On 30.05.2007 21:54, SonOfLilit wrote:
> Following is ruby-prof output. First, a few words:
>
> 4.75 against 1.56 isn't too good, but it isn't ten times slower. I
> think it can be excused (and maybe the code can be optimized a bit) in
> cases where you'd process huge lists with ruby of all languages.
>
> This is the code used to profile:
>
> require 'rubygems'
> require 'ruby-prof'
>
> module Enumerable
> def serially(&b)
> a = []
> self.each{|x| t = [x].instance_eval(&b); a << t[0] unless t.empty?}
> a
> end
> end
>
> # Profile the code
> result = RubyProf.profile{
> a = (1..50000)
> b = a.serially{select{|x| x>280}.collect{|x|"0x" <<
> x.to_s}.select{|x| x != "0x15"}}
> }
>
> # Print a graph profile to text
> printer = RubyProf::GraphPrinter.new(result)
> printer.print(STDOUT, 0)
>
> result = RubyProf.profile{
> a = (1..50000)
> b = a.select{|x| x>280}.collect{|x|"0x" << x.to_s}.select{|x| x != "0x15"}
> }
>
> # Print a graph profile to text
> printer = RubyProf::GraphPrinter.new(result)
> printer.print(STDOUT, 0)

You are changing subject in between: Originally you started out with
memory consumption being the issue. Now you talk about speed. Your
solution is neither fast nor easy on the memory. For speed see the
benchmark below. It's not easy on the memory because you still create a
*ton* of temporary one element arrays (four per iteration if I'm not
mistaken) - this avoids the single large copy but allocating and GC'ing
all these temporaries imposes a significant overhead. The obvious and
simple solution here is to do it all in *one* iteration. So why invent
something new if there is an easy and efficient solution available?

Kind regards

robert


10:38:35 [Temp]: ./ser.rb
Rehearsal --------------------------------------------------
classic 3.359000 0.000000 3.359000 ( 3.338000)
serially 13.719000 0.016000 13.735000 ( 13.747000)
serially! 13.328000 0.000000 13.328000 ( 13.361000)
inject 4.297000 0.000000 4.297000 ( 4.339000)
each 3.297000 0.000000 3.297000 ( 3.273000)
---------------------------------------- total: 38.016000sec

user system total real
classic 3.203000 0.000000 3.203000 ( 3.265000)
serially 12.891000 0.000000 12.891000 ( 13.233000)
serially! 12.438000 0.000000 12.438000 ( 12.617000)
inject 3.937000 0.000000 3.937000 ( 3.985000)
each 2.937000 0.000000 2.937000 ( 3.008000)


#!ruby

require 'benchmark'

ITER = 10

module Enumerable
def serially(&b)
a = []
each{|x| t = [x].instance_eval(&b); a << t[0] unless t.empty?}
a
end
end

a = (1..50000)

Benchmark.bmbm(15) do |bench|
bench.report("classic") do
ITER.times do
# a = (1..50000)
b = a.select {|x| x>280}.collect{|x|"0x" << x.to_s}.select{|x| x != "0x15"}
end
end

bench.report("serially") do
ITER.times do
# a = (1..50000)
b = a.serially { select {|x| x>280}.collect{|x|"0x" << x.to_s}.select{|x| x != "0x15"} }
end
end

bench.report("serially!") do
ITER.times do
# a = (1..50000)
b = a.serially { select {|x| x>280}.collect! {|x|"0x" << x.to_s}.select {|x| x != "0x15"} }
end
end

bench.report("inject") do
ITER.times do
# a = (1..50000)
b = a.inject([]) do |acc, x|
if x > 280
x = "0x" << x.to_s
acc << x unless x == "0x15"
end
acc
end
end
end

bench.report("each") do
ITER.times do
# a = (1..50000)
b = []
a.each do |x|
if x > 280
x = "0x" << x.to_s
b << x unless x == "0x15"
end
end
end
end
end

Brad Phelan

5/31/2007 9:05:00 AM

0


>
> You are changing subject in between: Originally you started out with
> memory consumption being the issue. Now you talk about speed. Your
> solution is neither fast nor easy on the memory. For speed see the
> benchmark below. It's not easy on the memory because you still create a
> *ton* of temporary one element arrays (four per iteration if I'm not
> mistaken) - this avoids the single large copy but allocating and GC'ing
> all these temporaries imposes a significant overhead. The obvious and
> simple solution here is to do it all in *one* iteration. So why invent
> something new if there is an easy and efficient solution available?

I can get comparable results for my solution

Rehearsal --------------------------------------------------
serially 7.410000 0.120000 7.530000 ( 7.731576)
classic 5.890000 0.080000 5.970000 ( 6.123102)
---------------------------------------- total: 13.500000sec

user system total real
serially 6.500000 0.000000 6.500000 ( 6.581172)
classic 3.160000 0.010000 3.170000 ( 3.199095)


See http://xtargets.com/snippets/pos... for details

What kills me and makes the final overhead is a fast way to
*non-recursively* flatten an array of arrays. My solution
is

def self.flatten(o)
length = o.inject(0) {|m,i|m+i.length}
out = Array.new(length)
base = 0
o.each do |sub|
top = base + sub.length
out[base..top]=sub
base = top
end

end

It is much faster than the inject and concat simple
method but still too slow.

Any ideas ( apart from code it in C )

Brad