Asp Forum - ruby performance

absurd

6/26/2006 6:22:00 AM

Hello,
I am relatively new to both ruby and perl. I like a lot about ruby.
But I found ruby is about 5 - 8 times slower than perl when it comes
to large text processing. I don't know if this is a well known fact or
it just happens to me.

Thanks,
Nan

15 Answers

Robert Klemme

6/26/2006 6:46:00 AM

Nan Li wrote:
> Hello,
> I am relatively new to both ruby and perl. I like a lot about ruby.
> But I found ruby is about 5 - 8 times slower than perl when it comes
> to large text processing. I don't know if this is a well known fact or
> it just happens to me.

It's known to be slower although I'd doubt the factor you mentioned.
What piece of code did you benchmark?

Kind regards

robert

Kenosis

6/26/2006 4:51:00 PM

I concur. Please post your code so we can have a look. There are few
key got-cha's you need to look out for. Also, you could try re-bench
marking with YARV to see if that makes any significant difference in
your case.

Ken

Robert Klemme wrote:
> Nan Li wrote:
> > Hello,
> > I am relatively new to both ruby and perl. I like a lot about ruby.
> > But I found ruby is about 5 - 8 times slower than perl when it comes
> > to large text processing. I don't know if this is a well known fact or
> > it just happens to me.
>
> It's known to be slower although I'd doubt the factor you mentioned.
> What piece of code did you benchmark?
>
> Kind regards
>
> robert

absurd

6/27/2006 4:45:00 AM

> Robert Klemme wrote:
> > Nan Li wrote:
> > > Hello,
> > > I am relatively new to both ruby and perl. I like a lot about ruby.
> > > But I found ruby is about 5 - 8 times slower than perl when it comes
> > > to large text processing. I don't know if this is a well known fact or
> > > it just happens to me.
> >
> > It's known to be slower although I'd doubt the factor you mentioned.
> > What piece of code did you benchmark?
> >
> > Kind regards
> >
> > robert

Kenosis wrote:
> I concur. Please post your code so we can have a look. There are few
> key got-cha's you need to look out for. Also, you could try re-bench
> marking with YARV to see if that makes any significant difference in
> your case.
>
> Ken
>

Here is how I did my test:

I have 3 files:
1) genLog.pl

my $key = 'Start Start Start Start';
my @s = ( 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz' );

for ( $i =0; $i < 1024 * 1024; $i++ ) {
print $key, "\n";
foreach ( @s ) {
print $_, "\n";
}
}

2) test.pl
my $log = 'log';

my @block = ();

open( FD, $log );

while( <FD> ) {
chomp;
if ( m/Start Start Start Start/ ) {
push @block, $_;
}
}

print scalar @block, "\n";

3) test.rb

log = 'log'

block = []
File.open( log ) { |f|
f.each_line { |line|
line.chomp!
if ( line =~ /Start Start Start Start/ ) then
block << line
end
}
}

puts block.size

I used genLog.pl to generate a large text file, and then time test.pl
and test.rb
My test ran as belows:

[nan@athena test]$ perl genLog.pl > log
[nan@athena test]$ ls -lh log
-rw-rw-r-- 1 nan nan 78M Jun 27 00:25 log
[nan@athena test]$ time perl test.pl
1048576

real 0m3.469s
user 0m3.252s
sys 0m0.136s
[nan@athena test]$ time ruby test.rb
1048576

real 0m18.775s
user 0m16.525s
sys 0m0.336s

ruby program is about 6 times slower. The above 2 scripts use the
same language constructs, the same algorithm. The problem lies in the
language itself or the implementation of the language.

Robert Klemme

6/27/2006 5:59:00 AM

Nan Li wrote:
> Here is how I did my test:
>
> I have 3 files:
> 1) genLog.pl
>
> my $key = 'Start Start Start Start';
> my @s = ( 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz' );
>
> for ( $i =0; $i < 1024 * 1024; $i++ ) {
> print $key, "\n";
> foreach ( @s ) {
> print $_, "\n";
> }
> }
>
> 2) test.pl
> my $log = 'log';
>
> my @block = ();
>
> open( FD, $log );
>
> while( <FD> ) {
> chomp;
> if ( m/Start Start Start Start/ ) {
> push @block, $_;
> }
> }
>
> print scalar @block, "\n";
>
> 3) test.rb
>
> log = 'log'
>
> block = []
> File.open( log ) { |f|
> f.each_line { |line|
> line.chomp!
> if ( line =~ /Start Start Start Start/ ) then

Revesing the RX and the string is usually more efficient.

> block << line
> end
> }
> }
>
> puts block.size
>
> I used genLog.pl to generate a large text file, and then time test.pl
> and test.rb
> My test ran as belows:
>
> [nan@athena test]$ perl genLog.pl > log
> [nan@athena test]$ ls -lh log
> -rw-rw-r-- 1 nan nan 78M Jun 27 00:25 log
> [nan@athena test]$ time perl test.pl
> 1048576
>
> real 0m3.469s
> user 0m3.252s
> sys 0m0.136s
> [nan@athena test]$ time ruby test.rb
> 1048576
>
> real 0m18.775s
> user 0m16.525s
> sys 0m0.336s
>
> ruby program is about 6 times slower.

I see only a factor of 5 but anyway, that's still too much. Did you do
just a single run or did you run your scripts at least several times to
get statistical valid data? If not, I suggest you do each test 10 times
and see what happens.

> The above 2 scripts use the
> same language constructs, the same algorithm. The problem lies in the
> language itself or the implementation of the language.

One difference is that you don't close the IO handle properly in the
perl script. OTOH this test is quite artificial. If you just wanted to
count those lines a simple scalar would have sufficed.

Kind regards

robert

Martin DeMello

6/27/2006 7:50:00 AM

Robert Klemme <bob.news@gmx.net> wrote:
> > 3) test.rb
> >
> > log = 'log'
> >
> > block = []
> > File.open( log ) { |f|
> > f.each_line { |line|
> > line.chomp!
> > if ( line =~ /Start Start Start Start/ ) then
>
> Revesing the RX and the string is usually more efficient.

I was pretty sure the problem was creating a regexp object from a
literal regexp each time, but oddly enough saying rx = /..../ before the
loop and rx =~ line inside made no difference. Does ruby already
optimise this case?

martin

Robert Klemme

6/27/2006 8:27:00 AM

Martin DeMello wrote:
> Robert Klemme <bob.news@gmx.net> wrote:
>>> 3) test.rb
>>>
>>> log = 'log'
>>>
>>> block = []
>>> File.open( log ) { |f|
>>> f.each_line { |line|
>>> line.chomp!
>>> if ( line =~ /Start Start Start Start/ ) then
>> Revesing the RX and the string is usually more efficient.
>
> I was pretty sure the problem was creating a regexp object from a
> literal regexp each time, but oddly enough saying rx = /..../ before the
> loop and rx =~ line inside made no difference. Does ruby already
> optimise this case?

Yes. It's usually more efficient to use the literal inside the code.

Cheers

robert

Kenosis

6/27/2006 7:56:00 PM

Ran some tests on my 2.8GHz Pentium D Dual Core, 2GB, 160 GB S-ATA II.
Things we're much more like Robert expected: 3.27 times slower.

544-> time perl test.pl
1048576
3.215u 0.143s 0:03.37 99.4% 0+0k 0+0io 0pf+0w
545-> time ruby test.rb
1048576
10.532u 0.350s 0:10.98 99.0% 0+0k 0+0io 8pf+0w

Now then, changing the regexp to a precreated one ran SLOWER for me
(huh?)

549-> time ruby test1.rb
1048576
11.006u 0.323s 0:11.36 99.6% 0+0k 0+0io 0pf+0w

Just for grins, presized the block array to the full size needed but
this had no impact what-so-ever. Hmmm....

Decided to run the profiler over it. Does it seem strange to you that
IO#each_line would (appear?) to take so long on a system w/the disk I/O
of mine when sequentially accessing a file???

ruby -r profile test.rb
1048576
% cumulative self self total
time seconds seconds calls ms/call ms/call name
78.91 455.14 455.14 1 455140.00 576810.00 IO#each_line
15.91 546.91 91.77 3145728 0.03 0.03 String#chomp!
5.18 576.81 29.90 1048576 0.03 0.03 Array#<<
0.00 576.81 0.00 2 0.00 0.00 IO#write
0.00 576.81 0.00 1 0.00 0.00 Array#size
0.00 576.81 0.00 1 0.00 0.00 Kernel.puts
0.00 576.81 0.00 1 0.00 0.00 Fixnum#to_s
0.00 576.81 0.00 1 0.00 576810.00 IO#open
0.00 576.81 0.00 1 0.00 0.00 File#initialize
0.00 576.81 0.00 1 0.00 576810.00 #toplevel

Ken

Robert Klemme wrote:
> Martin DeMello wrote:
> > Robert Klemme <bob.news@gmx.net> wrote:
> >>> 3) test.rb
> >>>
> >>> log = 'log'
> >>>
> >>> block = []
> >>> File.open( log ) { |f|
> >>> f.each_line { |line|
> >>> line.chomp!
> >>> if ( line =~ /Start Start Start Start/ ) then
> >> Revesing the RX and the string is usually more efficient.
> >
> > I was pretty sure the problem was creating a regexp object from a
> > literal regexp each time, but oddly enough saying rx = /..../ before the
> > loop and rx =~ line inside made no difference. Does ruby already
> > optimise this case?
>
> Yes. It's usually more efficient to use the literal inside the code.
>
> Cheers
>
> robert

Kenosis

6/27/2006 8:01:00 PM

And, not that it's practical in all cases but reading the file into
memory w/IO.readlines and then processing the result w/the block
provided cuts the time down to:

time ruby test4.rb
1048576
6.368u 0.807s 0:07.19 99.5% 0+0k 0+0io 0pf+0w

Seems like File.each_line has some issue?

Ken

Kenosis wrote:
> Ran some tests on my 2.8GHz Pentium D Dual Core, 2GB, 160 GB S-ATA II.
> Things we're much more like Robert expected: 3.27 times slower.
>
> 544-> time perl test.pl
> 1048576
> 3.215u 0.143s 0:03.37 99.4% 0+0k 0+0io 0pf+0w
> 545-> time ruby test.rb
> 1048576
> 10.532u 0.350s 0:10.98 99.0% 0+0k 0+0io 8pf+0w
>
> Now then, changing the regexp to a precreated one ran SLOWER for me
> (huh?)
>
> 549-> time ruby test1.rb
> 1048576
> 11.006u 0.323s 0:11.36 99.6% 0+0k 0+0io 0pf+0w
>
> Just for grins, presized the block array to the full size needed but
> this had no impact what-so-ever. Hmmm....
>
> Decided to run the profiler over it. Does it seem strange to you that
> IO#each_line would (appear?) to take so long on a system w/the disk I/O
> of mine when sequentially accessing a file???
>
> ruby -r profile test.rb
> 1048576
> % cumulative self self total
> time seconds seconds calls ms/call ms/call name
> 78.91 455.14 455.14 1 455140.00 576810.00 IO#each_line
> 15.91 546.91 91.77 3145728 0.03 0.03 String#chomp!
> 5.18 576.81 29.90 1048576 0.03 0.03 Array#<<
> 0.00 576.81 0.00 2 0.00 0.00 IO#write
> 0.00 576.81 0.00 1 0.00 0.00 Array#size
> 0.00 576.81 0.00 1 0.00 0.00 Kernel.puts
> 0.00 576.81 0.00 1 0.00 0.00 Fixnum#to_s
> 0.00 576.81 0.00 1 0.00 576810.00 IO#open
> 0.00 576.81 0.00 1 0.00 0.00 File#initialize
> 0.00 576.81 0.00 1 0.00 576810.00 #toplevel
>
> Ken
>
>
> Robert Klemme wrote:
> > Martin DeMello wrote:
> > > Robert Klemme <bob.news@gmx.net> wrote:
> > >>> 3) test.rb
> > >>>
> > >>> log = 'log'
> > >>>
> > >>> block = []
> > >>> File.open( log ) { |f|
> > >>> f.each_line { |line|
> > >>> line.chomp!
> > >>> if ( line =~ /Start Start Start Start/ ) then
> > >> Revesing the RX and the string is usually more efficient.
> > >
> > > I was pretty sure the problem was creating a regexp object from a
> > > literal regexp each time, but oddly enough saying rx = /..../ before the
> > > loop and rx =~ line inside made no difference. Does ruby already
> > > optimise this case?
> >
> > Yes. It's usually more efficient to use the literal inside the code.
> >
> > Cheers
> >
> > robert

Robert Klemme

6/27/2006 9:33:00 PM

Kenosis <kenosis@gmail.com> wrote:
> Ran some tests on my 2.8GHz Pentium D Dual Core, 2GB, 160 GB S-ATA
> II. Things we're much more like Robert expected: 3.27 times slower.
>
> 544-> time perl test.pl
> 1048576
> 3.215u 0.143s 0:03.37 99.4% 0+0k 0+0io 0pf+0w
> 545-> time ruby test.rb
> 1048576
> 10.532u 0.350s 0:10.98 99.0% 0+0k 0+0io 8pf+0w
>
> Now then, changing the regexp to a precreated one ran SLOWER for me
> (huh?)
>
> 549-> time ruby test1.rb
> 1048576
> 11.006u 0.323s 0:11.36 99.6% 0+0k 0+0io 0pf+0w

Yes, that's generally so.

> Just for grins, presized the block array to the full size needed but
> this had no impact what-so-ever. Hmmm....
>
> Decided to run the profiler over it. Does it seem strange to you that
> IO#each_line would (appear?) to take so long on a system w/the disk
> I/O of mine when sequentially accessing a file???

No, because each_line is called once but invokes the block multiple times.
This is not just the IO read time.

> ruby -r profile test.rb
> 1048576
> % cumulative self self total
> time seconds seconds calls ms/call ms/call name
> 78.91 455.14 455.14 1 455140.00 576810.00 IO#each_line
> 15.91 546.91 91.77 3145728 0.03 0.03 String#chomp!
> 5.18 576.81 29.90 1048576 0.03 0.03 Array#<<
> 0.00 576.81 0.00 2 0.00 0.00 IO#write
> 0.00 576.81 0.00 1 0.00 0.00 Array#size
> 0.00 576.81 0.00 1 0.00 0.00 Kernel.puts
> 0.00 576.81 0.00 1 0.00 0.00 Fixnum#to_s
> 0.00 576.81 0.00 1 0.00 576810.00 IO#open
> 0.00 576.81 0.00 1 0.00 0.00 File#initialize
> 0.00 576.81 0.00 1 0.00 576810.00 #toplevel

Cheers

robert

Robert Klemme

6/27/2006 9:35:00 PM

Kenosis <kenosis@gmail.com> wrote:
> And, not that it's practical in all cases but reading the file into
> memory w/IO.readlines and then processing the result w/the block
> provided cuts the time down to:
>
> time ruby test4.rb
> 1048576
> 6.368u 0.807s 0:07.19 99.5% 0+0k 0+0io 0pf+0w
>
> Seems like File.each_line has some issue?

Hm.... I don't think so. You should repeat your tests several times in
order to get meaningful results. In an application like this (i.e. all
files are read but not all need to be stored in mem) I would use the
each_line or File.foreach approach regardless of your benchmarks because
this scales better with regard to file size. You cannot slurp a 10GB file
into mem on a 32bit system but you can crunch it away line by line.

Kind regards

robert

comp.lang.ruby

ruby performance

absurd

Robert Klemme

Kenosis

absurd

Robert Klemme

Martin DeMello

Robert Klemme

Kenosis

Kenosis

Robert Klemme

Robert Klemme

x Login to ForumsZone