Asp Forum - Why csv file processing is so slow?

mepython

1/28/2005 3:35:00 PM

I want to process csv file. Here is small program in python and ruby:

[root@taamportable GMS]# cat x.py
import csv
reader = csv.reader(file('x.csv'))
header = reader.next()
count = 0
for data in reader:
count += 1
print count

[root@taamportable GMS]# cat x.rb
require 'csv'
reader = CSV.open('x.csv', 'r')
header = reader.shift
count = 0
reader.each {|data|
count += 1
}
p count

*******************************************************
Here is processing time: As you can see ruby is way to slow. Is there
anything to do about ruby code?
*******************************************************
[root@taamportable GMS]# time python x.py
26907

real 0m0.311s
user 0m0.302s
sys 0m0.009s

[root@taamportable GMS]# time ruby x.rb
26907

real 1m48.296s
user 1m36.853s
sys 0m11.188s

25 Answers

Robert Klemme

1/28/2005 3:46:00 PM

"mepython" <a@agni.us> schrieb im Newsbeitrag
news:1106926484.356041.45310@c13g2000cwb.googlegroups.com...
> I want to process csv file. Here is small program in python and ruby:
>
> [root@taamportable GMS]# cat x.py
> import csv
> reader = csv.reader(file('x.csv'))
> header = reader.next()
> count = 0
> for data in reader:
> count += 1
> print count
>
>
>
> [root@taamportable GMS]# cat x.rb
> require 'csv'
> reader = CSV.open('x.csv', 'r')
> header = reader.shift
> count = 0
> reader.each {|data|
> count += 1
> }
> p count
>
> *******************************************************
> Here is processing time: As you can see ruby is way to slow. Is there
> anything to do about ruby code?

First I'd try to figure whether it's IO that's slow or CSV. Did you test
with something like this:

File.open('x.csv') do |reader|
count = 0
reader.each {|data| count += 1}
p count
end

Does it make a huge difference?

Kind regards

robert

> *******************************************************
> [root@taamportable GMS]# time python x.py
> 26907
>
> real 0m0.311s
> user 0m0.302s
> sys 0m0.009s
>
>
> [root@taamportable GMS]# time ruby x.rb
> 26907
>
> real 1m48.296s
> user 1m36.853s
> sys 0m11.188s
>

mepython

1/28/2005 3:59:00 PM

It is csv module: reading file seems like half the speed of python. So
real slowness is coming from csv

count = 0
File.open('x.csv') do |reader|
reader.each {|data| count += 1}
end
p count

[root@taamportable GMS]# time ruby x1.rb
26908

real 0m0.077s
user 0m0.060s
sys 0m0.016s

[root@taamportable GMS]# time python x1.py
26908

real 0m0.041s
user 0m0.032s
sys 0m0.010s

Andrew Johnson

1/28/2005 4:01:00 PM

On 28 Jan 2005 07:34:44 -0800, mepython <a@agni.us> wrote:
[snip]

> Here is processing time: As you can see ruby is way to slow. Is there
> anything to do about ruby code?

Well, the python library csv.py uses the underlying _csv module which
is written in C ... Ruby's standard-lib csv.rb is all Ruby. I don't
know of any csv extensions for Ruby.

regards,
andrew

--
Andrew L. Johnson http://www.s...
It's kinda hard trying to remember Perl syntax *and* Occam's
razor at the same time :-)
-- Graham Patterson

mepython

1/28/2005 4:15:00 PM

Thanks andrew. I should have look into module before posting.

Robert Klemme

1/28/2005 4:57:00 PM

"mepython" <a@agni.us> schrieb im Newsbeitrag
news:1106927942.910013.321430@z14g2000cwz.googlegroups.com...
> It is csv module: reading file seems like half the speed of python. So
> real slowness is coming from csv
>
> count = 0
> File.open('x.csv') do |reader|
> reader.each {|data| count += 1}
> end
> p count
>
>
> [root@taamportable GMS]# time ruby x1.rb
> 26908
>
> real 0m0.077s
> user 0m0.060s
> sys 0m0.016s
>
>
> [root@taamportable GMS]# time python x1.py
> 26908
>
> real 0m0.041s
> user 0m0.032s
> sys 0m0.010s

As a simple CSV replacement you could try this:

File.open('x.csv') do |reader|
reader.each {|line|
count += 1
data = line.split(/,/)
}
end
p count

Depending on your data that might or might not be sufficient. Regexps can
be arbitrarily sophisticated. Here's another one:

data = []
line.scan( %r{
"((?:[^\\"]|\\")*)" |
'((?:[^\\']|\\')*)' |
([^,]+)
}x ){|m| data << m.find {|x|x}}

:-))

robert

William James

1/28/2005 10:24:00 PM

Robert Klemme wrote:
> Depending on your data that might or might not be sufficient.
Regexps can
> be arbitrarily sophisticated. Here's another one:
>
> data = []
> line.scan( %r{
> "((?:[^\\"]|\\")*)" |
> '((?:[^\\']|\\')*)' |
> ([^,]+)
> }x ){|m| data << m.find {|x|x}}

I borrowed your regexp.

% class String
% def parse_csv
% a = self.scan(
% %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^,]+ )
% }x ).flatten
% a.delete(nil)
% a
% end
% end
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

With this input

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b

the output is

["a", "b", "foo, bar", "c"]
["foo isn't \\\"bar\\\"", "a", "b"]
["a", "\"just,my,luck\"", "b"]

William James

1/29/2005 6:25:00 AM

William James wrote:

> % class String
> % def parse_csv
> % a = self.scan(
> % %r{ "( (?: [^\\"] | \\")* )" |
> % '( (?: [^\\'] | \\')* )' |
> % ( [^,]+ )
> % }x ).flatten
> % a.delete(nil)
> % a
> % end
> % end

To test the method parse_csv, I created a 1 megabyte file consisting of
4228 copies of

a,b,"foo, bar",c
"foo isn't \"bar\"",a,b
a,'"just,my,luck"',b
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9

Processing it using parse_csv took about 7 seconds on my computer,
which has a 866MHz pentium processor.

Ruby's standard-lib csv.rb reported an error in the file's format.

So I made a file containing 26907 copies of

111,222,333,444,555,666,777,888,999

Ruby's standard-lib csv.rb took about 35 seconds to process it;
parse_csv, about 5 seconds.

mepython

1/29/2005 1:43:00 PM

I got similar result with your parse_csv. This brings another issue in
my mind: This method is also in ruby so why such a huge overhead when
we use csv module vs. this method?

How can we modify so that we can pass field seperator and record
seperator as an argument?

William James wrote:
> William James wrote:
>
> > % class String
> > % def parse_csv
> > % a = self.scan(
> > % %r{ "( (?: [^\\"] | \\")* )" |
> > % '( (?: [^\\'] | \\')* )' |
> > % ( [^,]+ )
> > % }x ).flatten
> > % a.delete(nil)
> > % a
> > % end
> > % end
>
> To test the method parse_csv, I created a 1 megabyte file consisting
of
> 4228 copies of
>
> a,b,"foo, bar",c
> "foo isn't \"bar\"",a,b
> a,'"just,my,luck"',b
> 9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
> 9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
> 9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
>
> Processing it using parse_csv took about 7 seconds on my computer,
> which has a 866MHz pentium processor.
>
> Ruby's standard-lib csv.rb reported an error in the file's format.
>
> So I made a file containing 26907 copies of
>
> 111,222,333,444,555,666,777,888,999
>
> Ruby's standard-lib csv.rb took about 35 seconds to process it;
> parse_csv, about 5 seconds.

William James

1/29/2005 7:16:00 PM

mepython wrote:
>
> How can we modify so that we can pass field seperator and record
> seperator as an argument?

This should do it. I found that not rebuilding the regular-expression
every time parse_csv is called made it even faster.

% # Record separator.
% RS = "\n"
%
% # Set regexp for parse_csv.
% # fs is the field-separator
% def fs_is( fs )
% $csv_re = % %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^#{fs}]+ )
% }x
% end
%
% class String
% def parse_csv
% raise "Method fs_is() wasn't called." if $csv_re.nil?
% a = self.scan( $csv_re ).flatten
% a.delete(nil)
% a
% end
% end
%
% fs_is( ',' )
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

William James

1/29/2005 8:09:00 PM

Improved version:

% # Record separator.
% RS = "\n"
%
% class String
% # Set regexp for parse_csv.
% # self is the field-separator
% def is_fs
% $csv_re = % %r{ "( (?: [^\\"] | \\")* )" |
% '( (?: [^\\'] | \\')* )' |
% ( [^#{self}]+ )
% }x
% end
% def parse_csv
% raise "Method #is_fs wasn't called." if $csv_re.nil?
% self.scan( $csv_re ).flatten.compact
% end
% end
%
% ','.is_fs
%
% # Set Ruby's input record-separator.
% $/ = RS
%
% ARGF.each_line { | line |
% p line.chomp.parse_csv
% }

comp.lang.ruby

Why csv file processing is so slow?

mepython

Robert Klemme

mepython

Andrew Johnson

mepython

Robert Klemme

William James

William James

mepython

William James

William James

x Login to ForumsZone