[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Rio: which is the slow one?

Oliver Cromm

2/27/2006 11:35:00 PM

The speed difference looks too extreme too me:


caps = []
File.open('caps_u8.dic').each {|line| caps << line.split(';')[0]}

=> 1.8 seconds



require 'rio'
caps = rio('caps_u8.dic').csv(";").columns(0)[].flatten
p caps

=> 50.9 seconds


What exactly is so slow here? It's not the /flatten/.
Am I using Rio in a particularly dumb way?
--
Bug: An elusive creature living in a program that makes it incorrect.
The activity of "debugging," or removing bugs from a program, ends
when people get tired of doing it, not when the bugs are removed.
7 Answers

Meinrad Recheis

3/1/2006 6:43:00 AM

0

you could use the ruby profiler to find out exactly ....
-- henon

On 3/1/06, Oliver Cromm <lispamateur@internet.uqam.ca> wrote:
>
> The speed difference looks too extreme too me:
>
>
> caps = []
> File.open('caps_u8.dic').each {|line| caps << line.split(';')[0]}
>
> => 1.8 seconds
>
>
>
> require 'rio'
> caps = rio('caps_u8.dic').csv(";").columns(0)[].flatten
> p caps
>
> => 50.9 seconds
>
>
> What exactly is so slow here? It's not the /flatten/.
> Am I using Rio in a particularly dumb way?
> --
> Bug: An elusive creature living in a program that makes it incorrect.
> The activity of "debugging," or removing bugs from a program, ends
> when people get tired of doing it, not when the bugs are removed.
>
>

Christian Neukirchen

3/1/2006 11:20:00 AM

0

Oliver Cromm <lispamateur@internet.uqam.ca> writes:

> The speed difference looks too extreme too me:

Just guessing:

> caps = []
> File.open('caps_u8.dic').each {|line| caps << line.split(';')[0]}

One line read to memory at a time.
One line splitted at a time.
One element inserted at a time.

> => 1.8 seconds
>
> require 'rio'
> caps = rio('caps_u8.dic').csv(";").columns(0)[].flatten
> p caps
>
> => 50.9 seconds

All lines read to memory at once.
All lines splitted at once.
All fields except the first thrown away.
Another copy due to flatten.

> What exactly is so slow here? It's not the /flatten/.
> Am I using Rio in a particularly dumb way?

No, it's just more memory usage and heavier calculations.

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneuk...


James Gray

3/1/2006 2:20:00 PM

0

On Feb 28, 2006, at 7:59 PM, Oliver Cromm wrote:

> The speed difference looks too extreme too me:
>
>
> caps = []
> File.open('caps_u8.dic').each {|line| caps << line.split(';')[0]}
>
> => 1.8 seconds

Here you are rolling your own split.

> require 'rio'
> caps = rio('caps_u8.dic').csv(";").columns(0)[].flatten
> p caps
>
> => 50.9 seconds

And here Rio is using CSV, which is known to be plenty slow. If we
could get Rio patched to use FasterCSV when it is available, that
would help quite a bit...

James Edward Gray II



rio4ruby

3/5/2006 8:54:00 PM

0

James Edward Gray II wrote:
> On Feb 28, 2006, at 7:59 PM, Oliver Cromm wrote:
>
> > The speed difference looks too extreme too me:
> >
> >
> > caps = []
> > File.open('caps_u8.dic').each {|line| caps << line.split(';')[0]}
> >
> > => 1.8 seconds
>
> Here you are rolling your own split.
>
> > require 'rio'
> > caps = rio('caps_u8.dic').csv(";").columns(0)[].flatten
> > p caps
> >
> > => 50.9 seconds
>

This is a false comparison. The speedy code will not properly parse
many CSV files.

For example, the following is a legal line from a CSV file:

"Field 1","Hello, World", "Field 3"

The comma embedded in the second field precludes the use of a simple
+split+. In addition, the speedy version would include the double
quotes in the field value -- which is incorrect.

-Christopher

Oliver Cromm

3/6/2006 4:18:00 AM

0

rio4ruby wrote:

> James Edward Gray II wrote:
>> On Feb 28, 2006, at 7:59 PM, Oliver Cromm wrote:
>>
>>> The speed difference looks too extreme too me:
>>>
>>>
>>> caps = []
>>> File.open('caps_u8.dic').each {|line| caps << line.split(';')[0]}
>>>
>>> => 1.8 seconds
>>
>> Here you are rolling your own split.
>>
>>> require 'rio'
>>> caps = rio('caps_u8.dic').csv(";").columns(0)[].flatten
>>> p caps
>>>
>>> => 50.9 seconds
>>
>
> This is a false comparison. The speedy code will not properly parse
> many CSV files.

I didn't claim they are equivalent in principle; but for the purpose at
hand, they are. And in this case, I wouldn't have cared if one version
takes 5 times as long, but 25 times is not practicable - that speed
difference would easily justify, say, 15 minutes more time for
programming, so I could cover a lot of cases.

> For example, the following is a legal line from a CSV file:
>
> "Field 1","Hello, World", "Field 3"

I doubt that a split(/\",\s*\"/) (plus necessary adjustments) would be
much slower.
--
Oliver C.
Die demoskopische Hauptzielgruppe von "Focus" sind Maenner aus dem
gehobenen Mittelstand zwischen 40 und 65 (IQ, nicht Alter).
Andreas Kabel in de.etc.sprache.deutsch

rio4ruby

3/7/2006 8:50:00 PM

0

Oliver Cromm wrote:
> The speed difference looks too extreme too me:
> ...
> What exactly is so slow here?

Good Question. The first problem is that you are using a general
purpose CSV parser to split strings. However, the difference you
report is too extreme for that to be the only issue.

I created four test cases:

ruby split:
caps = []
File.open(fn).each {|line| caps << line.chomp.split(',')[0] }

rio split:
caps = []
rio(fn).chomp.lines { |line| caps << line.split(',')[0] }

stdlib csv:
caps = []
File.open(fn).each {|line| caps << CSV.parse_line(line)[0] }

rio csv:
caps = rio(fn).csv.columns(0)[].flatten

Benchmarking these cases on a 10000 line CSV file yielded:
ruby split: 0.516000
rio split : 0.984000
stdlib csv: 3.047000
rio csv : 15.610000

This shows that Rio incurs a 2x overhead when reading lines from a
file, which is reasonable, considering the features of Rio not
illustrated in this trivial example.

Using the standard library's CSV incurs 6x overhead, which seems a bit
high but is not unreasonable, considering the difference in complexity
between splitting a string and parsing a CSV line. The CSV module
could probably be more efficient.

Using Rio to call the standard library's CSV incurs a 5x overhead
above calling the standard library's CSV. This yields an overhead of
30x compared to the stdlib split. This is close to what you report
(28x).

The 5x overhead incured when using Rio to call CSV does seem too
high. One would expect it to be closer to 2x.

The reason for the high overhead is the feature of Rio that extends
every Array returned from a CSV file with a custom +to_s+ method,
which will convert the Array back to a CSV line. Without this feature
the "rio csv" case yields:

rio csv : 5.750000

which is a 1.9x over the stdlib CSV.

I was dubious that extending each Array was a good thing even if it
cost nothing. It is certainly not a good thing with such a high
perfomance penalty. I will remove this feature in the next release.
Beyond this, the only thing that will make Rio's handling of CSV files
is a faster CSV module (FasterCSV perhaps) and perfomance improvements
in Rio, which will be addressed when Rio reaches Alpha.

Thanks for bringing this to my attention.

Cheers,
-Christopher

James Gray

3/7/2006 8:58:00 PM

0

On Mar 7, 2006, at 2:53 PM, rio4ruby wrote:

> Using the standard library's CSV incurs 6x overhead, which seems a bit
> high but is not unreasonable, considering the difference in complexity
> between splitting a string and parsing a CSV line. The CSV module
> could probably be more efficient.

See FasterCSV. ;)

James Edward Gray II