Asp Forum - Splitting a CSV file into 40,000 line chunks

Drew Olson

11/29/2006 3:33:00 PM

All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
xyz_part_2.csv
etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
if(index != 0 && index%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (index == 0)
header = row
end
writer << row
index += 1
end

writer.close()

--
Posted via http://www.ruby-....

41 Answers

Simon Strandgaard

11/29/2006 4:34:00 PM

On 11/29/06, Drew Olson <olsonas@gmail.com> wrote:
> My code is working but I don't find it very "rubyish". In particular, I
> hate having my index and counter counters and I don't like that I had to
> declare my header variable outside of the loop. Bear in mind here that I
> can not do something like "rows = CSV.open(infile)" because ruby will
> yell and error as the input file is too big (250 mb). Any advice on
> making the code nicer is appreciated. The current code is as follows:

File.open("test", "w+") do |f|
20.times do |i|
f.write "abc|def|ghi#{i}\n"
end
end

File.open("test", "r") do |fin|
part = 0
while row = fin.gets

File.open("test_part#{part}", "w+") do |f|
5.times do
f.write(row)
break unless row = fin.gets
end
end

part += 1
end
end

--
Simon Strandgaard
http://opc...

Paul Lutus

11/29/2006 4:47:00 PM

Drew Olson wrote:

> All -
>
> I've written a script to split a .csv file into smaller .csv files of
> 40,000 lines each. The intent here is to break the file down enough so
> that excel does not have issues reading each chunk. My code takes a
> filename from the command line and breaks it down as so:
>
> infile -> xyz.csv
>
> output -> xyz_part_1.csv
> xyz_part_2.csv
> etc...
>
> My code is working but I don't find it very "rubyish". In particular, I
> hate having my index and counter counters

Consider that the problem is one of counting input lines. In a case like
this, it is not possible to avoid using a counter. It's in the nature of
the problem to be solved.

> and I don't like that I had to
> declare my header variable outside of the loop. Bear in mind here that I
> can not do something like "rows = CSV.open(infile)" because ruby will
> yell and error as the input file is too big (250 mb). Any advice on
> making the code nicer is appreciated. The current code is as follows:
>
> require 'csv'
>
> infile = ARGV[0] if ARGV[0] != nil
>
> counter = 1
> index = 0
> header = ""
> writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
>
> CSV.open(infile, 'r') do |row|

Why are you using CSV for this? You aren't parsing the lines into fields, so
the fact that they contain CSV content has no bearing on the present task.
Your goal is to split the input file into groups of lines delimited by
linefeeds, not fields delimited by commas.

Why not simply read lines from the input file and write them to a series of
output files, until the input file is exhausted?

------------------------------------------

#!/usr/bin/ruby -w

max_output_lines = 1000

input_file = "test.txt"

output_base = "output"

n = 0

ifile = File.open(input_file,"r")

header = ifile.gets

until(ifile.eof?)
ofn = output_base + sprintf("%03d",n) + ".txt"
ofile = File.open(ofn,"w")
ofile.write(header)
line = 2
until(ifile.eof? || line > max_output_lines)
ofile.write(ifile.gets)
line += 1
end
ofile.close
n += 1
end

ifile.close

------------------------------------------

Just change the number for "max_output_lines" to suit your requirement.

--
Paul Lutus
http://www.ara...

will

11/29/2006 4:58:00 PM

Paul Lutus wrote:
>> CSV.open(infile, 'r') do |row|
>
> Why are you using CSV for this? You aren't parsing the lines into fields, so
> the fact that they contain CSV content has no bearing on the present task.
> Your goal is to split the input file into groups of lines delimited by
> linefeeds, not fields delimited by commas.
>
> Why not simply read lines from the input file and write them to a series of
> output files, until the input file is exhausted?

Because CSV understands csv data with embedded newlines:

[will@localhost ~]$ cat csvfile.csv
some, field,"new
line"
other,field,here
[will@localhost ~]$ cat test.rb
require 'csv'

CSV.open('csvfile.csv', 'r') do |row|
p row
end
[will@localhost ~]$ ruby test.rb
["some", " field", "new\nline"]
["other", "field", "here"]

will.

James Gray

11/29/2006 4:58:00 PM

On Nov 29, 2006, at 10:53 AM, Nick Sieger wrote:

> On 11/29/06, Drew Olson <olsonas@gmail.com> wrote:
>>
>> I've written a script to split a .csv file into smaller .csv files of
>> 40,000 lines each. The intent here is to break the file down
>> enough so
>> that excel does not have issues reading each chunk. My code takes a
>> filename from the command line and breaks it down as so:
>
>
> Not ruby, but why not use "split -l 40000"?

Because the sample code is also moving the headers to the new files.

James Edward Gray II

Chris Hulan

11/29/2006 4:59:00 PM

Nice Simon

But the 'while row = fin.gets' is skipping a row each time around and
you don't handle the header. My "fix":
Simon Strandgaard wrote:...
> File.open("test", "w+") do |f|
> 20.times do |i|
> f.write "abc|def|ghi#{i}\n"
> end
> end
>
> File.open("test", "r") do |fin|
> part = 0
header = fin.gets
row = fin.gets
while row
>
> File.open("test_part#{part}", "w+") do |f|
f.write(header)
> 5.times do
> f.write(row)
> break unless row = fin.gets
> end
> end
>
> part += 1
> end
> end
....

Cheers
Chris

Chris Hulan

11/29/2006 5:03:00 PM

James Edward Gray II wrote:
> On Nov 29, 2006, at 10:53 AM, Nick Sieger wrote:
>
> > Not ruby, but why not use "split -l 40000"?
>
> Because the sample code is also moving the headers to the new files.
>
> James Edward Gray II

Also since he's feeding Excel this is most likely on Windows rather
than *NIX

Cheers
Chris

Simon Strandgaard

11/29/2006 5:07:00 PM

On 11/29/06, ChrisH <chris.hulan@gmail.com> wrote:
> Nice Simon
>
> But the 'while row = fin.gets' is skipping a row each time around and
> you don't handle the header. My "fix":

I usually mess up in my loops like this :-(
Thanks :-)

--
Simon Strandgaard

Paul Lutus

11/29/2006 5:24:00 PM

Will Jessop wrote:

> Paul Lutus wrote:
>>> CSV.open(infile, 'r') do |row|
>>
>> Why are you using CSV for this? You aren't parsing the lines into fields,
>> so the fact that they contain CSV content has no bearing on the present
>> task. Your goal is to split the input file into groups of lines delimited
>> by linefeeds, not fields delimited by commas.
>>
>> Why not simply read lines from the input file and write them to a series
>> of output files, until the input file is exhausted?
>
> Because CSV understands csv data with embedded newlines:

A plain-text CSV file uses linefeeds as record delimiters. A program that
uses "readline" or "gets" splits the records just as a sane CSV parser
would. And IMHO a CSV file should never, ever have linefeeds embedded in
fields.

> [will@localhost ~]$ cat csvfile.csv
> some, field,"new
> line"
> other,field,here

Ah, yes, another terrific "improvement" in database protocols. Sometimes I
wonder what these standards committees are smoking.

--
Paul Lutus
http://www.ara...

James Gray

11/29/2006 5:41:00 PM

On Nov 29, 2006, at 9:32 AM, Drew Olson wrote:

> Any advice on making the code nicer is appreciated.

I'll let you decide if this is any better:

Firefly:~/Desktop$ ls
gen_csv.rb split_csv.rb
Firefly:~/Desktop$ ruby gen_csv.rb
Firefly:~/Desktop$ ls
data.csv gen_csv.rb split_csv.rb
Firefly:~/Desktop$ cat data.csv
rec_no,whatever
0,"blah, blah, blah, ..."
1,"blah, blah, blah, ..."
2,"blah, blah, blah, ..."
3,"blah, blah, blah, ..."
4,"blah, blah, blah, ..."
5,"blah, blah, blah, ..."
6,"blah, blah, blah, ..."
7,"blah, blah, blah, ..."
8,"blah, blah, blah, ..."
9,"blah, blah, blah, ..."
10,"blah, blah, blah, ..."
11,"blah, blah, blah, ..."
12,"blah, blah, blah, ..."
13,"blah, blah, blah, ..."
14,"blah, blah, blah, ..."
15,"blah, blah, blah, ..."
16,"blah, blah, blah, ..."
17,"blah, blah, blah, ..."
18,"blah, blah, blah, ..."
19,"blah, blah, blah, ..."
Firefly:~/Desktop$ ruby split_csv.rb data.csv
Firefly:~/Desktop$ ls
data.csv data_part_2.csv data_part_4.csv gen_csv.rb
data_part_1.csv data_part_3.csv data_part_5.csv split_csv.rb
Firefly:~/Desktop$ cat data_part_1.csv
rec_no,whatever
0,"blah, blah, blah, ..."
1,"blah, blah, blah, ..."
2,"blah, blah, blah, ..."
3,"blah, blah, blah, ..."
Firefly:~/Desktop$ cat data_part_3.csv
rec_no,whatever
8,"blah, blah, blah, ..."
9,"blah, blah, blah, ..."
10,"blah, blah, blah, ..."
11,"blah, blah, blah, ..."
Firefly:~/Desktop$ cat split_csv.rb
#!/usr/bin/env ruby -w

require "rubygems"
require "faster_csv"

original_file = ARGV.shift
counter = 1

FCSV.open(original_file.sub(".", "_part_#{counter}."), "w") do |out|
FCSV.open(original_file, :headers => true).each_with_index do |
row, index|
if (index % 4).zero?
if index.nonzero?
counter += 1
out.reopen(original_file.sub(".", "_part_#{counter}."), "w")
end
out << row.headers
end
out << row
end
end

__END__

James Edward Gray II

Drew Olson

11/29/2006 6:11:00 PM

Thanks for all the responses. As noted in a post above, I am trying to
place the headers at the beginning of each split file. Should have
mentioned that earlier. Also, regarding the above responses, I am
working on a Windows system, hence not being able to use nice shell
scripts.

Anyway, I seems that my logic was on the right track, the use of ugly
counters just made my inner-rubyist upset. Thanks for all the feedback,
I will take a look and reading/writing the files using the File object
as opposed to the CSV object. In fact, I didn't even think about that.
In previous scripts I had been accessing the CSV fields individually
but, as it is unnecessary here, I will toy around with performance gains
using File object.

Speaking of performance, using the code I've described seemed to be
quite slow. This is understandable based on the fact that the input file
has 1 million+ records, however I was wondering if I would see
performance gain by using File objects as opposed to CSV objects. Also,
any other tips on increasing the performance of this code would be
helpful.

Again, thanks for all the responses. I am continually impressed by the
attitude and intelligence of this community.

--
Posted via http://www.ruby-....

comp.lang.ruby

Splitting a CSV file into 40,000 line chunks

Drew Olson

Simon Strandgaard

Paul Lutus

will

James Gray

Chris Hulan

Chris Hulan

Simon Strandgaard

Paul Lutus

James Gray

Drew Olson

x Login to ForumsZone