Asp Forum - Multiline (block) CSV file processing

Phil Rhoades

1/10/2008 1:47:00 PM

People,

I am looking for suggestions for Ruby utilities (and gems?) for a
flexible, easy method of processing multi-line blocks of CSV text eg in
a CSV file, lines 1-5 are the first block and lines 6-10 are the second
block etc. Then for each block I want to:

- print the first field of line 1
- second field of line 2
- fifth field of line 3
- tenth field of line 4
- twelfth field of line 5

as fields of a new line.

Of course I could do this myself from basics but I thought there might
be existing tools that would allow me to do things like easily for
different block sizes, different fields on each line of the block etc

Thanks,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au

9 Answers

James Gray

1/10/2008 3:12:00 PM

On Jan 10, 2008, at 7:47 AM, Phil Rhoades wrote:

> I am looking for suggestions for Ruby utilities (and gems?) for a
> flexible, easy method of processing multi-line blocks of CSV text eg
> in
> a CSV file, lines 1-5 are the first block and lines 6-10 are the
> second
> block etc. Then for each block I want to:
>
> - print the first field of line 1
> - second field of line 2
> - fifth field of line 3
> - tenth field of line 4
> - twelfth field of line 5
>
> as fields of a new line.

Well, if I fully understand the request, I would use code like the
following with the fastercsv gem:

#!/usr/bin/env ruby -wKU

require "rubygems"
require "faster_csv"

input = FCSV.open(ARGV.shift)
output = FCSV.open("filtered.csv", "w")

catch(:out_of_lines) do
loop do
lines = Array.new(5) { input.shift or throw :out_of_lines }
output << lines.zip([0, 1, 4, 9, 11]).map { |line, i| line[i] }
end
end

__END__

Hope that helps.

James Edward Gray II

Phil Rhoades

1/10/2008 4:19:00 PM

James,

On Fri, 2008-01-11 at 00:12 +0900, James Gray wrote:
> On Jan 10, 2008, at 7:47 AM, Phil Rhoades wrote:
>
> > I am looking for suggestions for Ruby utilities (and gems?) for a
> > flexible, easy method of processing multi-line blocks of CSV text eg
> > in
> > a CSV file, lines 1-5 are the first block and lines 6-10 are the
> > second
> > block etc. Then for each block I want to:
> >
> > - print the first field of line 1
> > - second field of line 2
> > - fifth field of line 3
> > - tenth field of line 4
> > - twelfth field of line 5
> >
> > as fields of a new line.
>
> Well, if I fully understand the request, I would use code like the
> following with the fastercsv gem:
>
> #!/usr/bin/env ruby -wKU
>
> require "rubygems"
> require "faster_csv"
>
> input = FCSV.open(ARGV.shift)
> output = FCSV.open("filtered.csv", "w")
>
> catch(:out_of_lines) do
> loop do
> lines = Array.new(5) { input.shift or throw :out_of_lines }
> output << lines.zip([0, 1, 4, 9, 11]).map { |line, i| line[i] }
> end
> end
>
> __END__
>
> Hope that helps.

Thanks but not quite - say my input file is:

1 2 3 4 5 6 7 8 9 a b c
11 12 13 14 15 16 17 18 19 d e f
21 22 23 24 25 26 27 28 29 g h i
31 32 33 34 35 36 37 38 39 j k l
41 42 43 44 45 46 47 48 49 m n o

The output should be:

1 12 25 j o

Regards,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au

James Gray

1/10/2008 5:34:00 PM

On Jan 10, 2008, at 10:18 AM, Phil Rhoades wrote:

> James,
>
>
> On Fri, 2008-01-11 at 00:12 +0900, James Gray wrote:
>> On Jan 10, 2008, at 7:47 AM, Phil Rhoades wrote:
>>
>>> I am looking for suggestions for Ruby utilities (and gems?) for a
>>> flexible, easy method of processing multi-line blocks of CSV text eg
>>> in
>>> a CSV file, lines 1-5 are the first block and lines 6-10 are the
>>> second
>>> block etc. Then for each block I want to:
>>>
>>> - print the first field of line 1
>>> - second field of line 2
>>> - fifth field of line 3
>>> - tenth field of line 4
>>> - twelfth field of line 5
>>>
>>> as fields of a new line.
>>
>> Well, if I fully understand the request, I would use code like the
>> following with the fastercsv gem:
>>
>> #!/usr/bin/env ruby -wKU
>>
>> require "rubygems"
>> require "faster_csv"
>>
>> input = FCSV.open(ARGV.shift)
>> output = FCSV.open("filtered.csv", "w")
>>
>> catch(:out_of_lines) do
>> loop do
>> lines = Array.new(5) { input.shift or throw :out_of_lines }
>> output << lines.zip([0, 1, 4, 9, 11]).map { |line, i| line[i] }
>> end
>> end
>>
>> __END__
>>
>> Hope that helps.
>
>
> Thanks but not quite - say my input file is:
>
> 1 2 3 4 5 6 7 8 9 a b c
> 11 12 13 14 15 16 17 18 19 d e f
> 21 22 23 24 25 26 27 28 29 g h i
> 31 32 33 34 35 36 37 38 39 j k l
> 41 42 43 44 45 46 47 48 49 m n o
>
> The output should be:
>
> 1 12 25 j o

Surely, I got you close enough to finish it off, right? ;)

If you're saying that your file is whitespace separated, as you show
above set :col_sep for FasterCSV. If your data really doesn't contain
quoted fields as shown above, you probably don't a CSV parser at all.
You can read with split() and write with join().

If you run into problems or have more specific questions, ask and I'll
do my best to help.

James Edward Gray II

Brian Adkins

1/10/2008 6:40:00 PM

On Jan 10, 11:18 am, Phil Rhoades <p...@pricom.com.au> wrote:
> Thanks but not quite - say my input file is:
>
> 1 2 3 4 5 6 7 8 9 a b c
> 11 12 13 14 15 16 17 18 19 d e f
> 21 22 23 24 25 26 27 28 29 g h i
> 31 32 33 34 35 36 37 38 39 j k l
> 41 42 43 44 45 46 47 48 49 m n o
>
> The output should be:
>
> 1 12 25 j o

File.open("data.txt", "r") do |file|
i = 0
file.each_line do |line|
print line.chomp.split[[1, 2, 5, 10, 12][i]-1] + ' '
puts '' if (i = (i + 1) % 5) == 0
end
end

Or, the following might be more fun:

def each_block file
while !file.eof
result = []
5.times { result << file.readline.chomp }
yield result
end
rescue
end

File.open("data.txt", "r") do |file|
each_block(file) do |block|
[1, 2, 5, 10, 12].zip(block).each do |field, line|
print line.split[field-1] + ' '
end
puts ''
end
end

Brian Adkins

Brian Adkins

1/10/2008 7:17:00 PM

On Jan 10, 1:40 pm, Brian Adkins <lojicdot...@gmail.com> wrote:
> On Jan 10, 11:18 am, Phil Rhoades <p...@pricom.com.au> wrote:
>
> > Thanks but not quite - say my input file is:
>
> > 1 2 3 4 5 6 7 8 9 a b c
> > 11 12 13 14 15 16 17 18 19 d e f
> > 21 22 23 24 25 26 27 28 29 g h i
> > 31 32 33 34 35 36 37 38 39 j k l
> > 41 42 43 44 45 46 47 48 49 m n o
>
> > The output should be:
>
> > 1 12 25 j o

This is a little more general.

def block_extractor file, fields,
splitter = lambda {|line| line.split }
while !file.eof
result = []
fields.each do |field|
line = file.readline
result << splitter.call(line.chomp)[field-1] if field
end
yield result
end
end

File.open("data.txt", "r") do |file|
block_extractor(file, [1,2,5,10,12]) do |fields|
puts fields.join(' ')
end
end

Phil Rhoades

1/10/2008 9:48:00 PM

Brian,

On Fri, 2008-01-11 at 04:19 +0900, Brian Adkins wrote:
> On Jan 10, 1:40 pm, Brian Adkins <lojicdot...@gmail.com> wrote:
> > On Jan 10, 11:18 am, Phil Rhoades <p...@pricom.com.au> wrote:
> >
> > > Thanks but not quite - say my input file is:
> >
> > > 1 2 3 4 5 6 7 8 9 a b c
> > > 11 12 13 14 15 16 17 18 19 d e f
> > > 21 22 23 24 25 26 27 28 29 g h i
> > > 31 32 33 34 35 36 37 38 39 j k l
> > > 41 42 43 44 45 46 47 48 49 m n o
> >
> > > The output should be:
> >
> > > 1 12 25 j o
>
> This is a little more general.
>
> def block_extractor file, fields,
> splitter = lambda {|line| line.split }
> while !file.eof
> result = []
> fields.each do |field|
> line = file.readline
> result << splitter.call(line.chomp)[field-1] if field
> end
> yield result
> end
> end
>
> File.open("data.txt", "r") do |file|
> block_extractor(file, [1,2,5,10,12]) do |fields|
> puts fields.join(' ')
> end
> end

Thanks! - now I just need to work out how that actually works and then
work out how I can modify it to use command line parameters.

Regards,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au

Phil Rhoades

1/10/2008 10:21:00 PM

Brian,

On Fri, 2008-01-11 at 06:47 +0900, Phil Rhoades wrote:
> Brian,
>
>
> On Fri, 2008-01-11 at 04:19 +0900, Brian Adkins wrote:
> > On Jan 10, 1:40 pm, Brian Adkins <lojicdot...@gmail.com> wrote:
> > > On Jan 10, 11:18 am, Phil Rhoades <p...@pricom.com.au> wrote:
> > >
> > > > Thanks but not quite - say my input file is:
> > >
> > > > 1 2 3 4 5 6 7 8 9 a b c
> > > > 11 12 13 14 15 16 17 18 19 d e f
> > > > 21 22 23 24 25 26 27 28 29 g h i
> > > > 31 32 33 34 35 36 37 38 39 j k l
> > > > 41 42 43 44 45 46 47 48 49 m n o
> > >
> > > > The output should be:
> > >
> > > > 1 12 25 j o
> >
> > This is a little more general.
> >
> > def block_extractor file, fields,
> > splitter = lambda {|line| line.split }
> > while !file.eof
> > result = []
> > fields.each do |field|
> > line = file.readline
> > result << splitter.call(line.chomp)[field-1] if field
> > end
> > yield result
> > end
> > end
> >
> > File.open("data.txt", "r") do |file|
> > block_extractor(file, [1,2,5,10,12]) do |fields|
> > puts fields.join(' ')
> > end
> > end
>
>
> Thanks! - now I just need to work out how that actually works and then
> work out how I can modify it to use command line parameters.

I apologise for replying to my own post but I have had a look at this
and read up about Procs and Lambdas and I can sorta see what you are
doing but would you be so kind as to elaborate on the code a bit? - I
think other people would find it useful as well . .

Also, to generalise the code further, how would you select two of more
fields from each line?

Thanks,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au

Brian Adkins

1/12/2008 4:19:00 AM

On Jan 10, 5:21 pm, Phil Rhoades <p...@pricom.com.au> wrote:
> On Fri, 2008-01-11 at 06:47 +0900, Phil Rhoades wrote:
> > On Fri, 2008-01-11 at 04:19 +0900, Brian Adkins wrote:
> > > def block_extractor file, fields,
> > > splitter = lambda {|line| line.split }
> > > while !file.eof
> > > result = []
> > > fields.each do |field|
> > > line = file.readline
> > > result << splitter.call(line.chomp)[field-1] if field
> > > end
> > > yield result
> > > end
> > > end
>
> > > File.open("data.txt", "r") do |file|
> > > block_extractor(file, [1,2,5,10,12]) do |fields|
> > > puts fields.join(' ')
> > > end
> > > end
>
> > Thanks! - now I just need to work out how that actually works and then
> > work out how I can modify it to use command line parameters.
>
> I apologise for replying to my own post but I have had a look at this
> and read up about Procs and Lambdas and I can sorta see what you are
> doing but would you be so kind as to elaborate on the code a bit? - I
> think other people would find it useful as well . .

I'd be glad to. What question do you have?

> Also, to generalise the code further, how would you select two of more
> fields from each line?

Well, this is very much a toy/example program, so I wouldn't build on
it too much. There is a cost and a benefit to generalization, so it
might be worthwhile to spend some time thinking about how general you
need the function to be.

A simple way to select two or more fields from each line would be to
change from:

[a, b, ...]

to:

[ [a1, a2, ...] [b1, b2, ...] ... ]

or possibly use a hash with the key being a block-relative line
number, and the value being a list of field numbers. Or you may want
an external specification of the fields to extract - kind of like HTML
templating in reverse.

Brian Adkins

Phil Rhoades

1/12/2008 4:54:00 AM

Brian,

On Sat, 2008-01-12 at 13:20 +0900, Brian Adkins wrote:
> On Jan 10, 5:21 pm, Phil Rhoades <p...@pricom.com.au> wrote:
> > On Fri, 2008-01-11 at 06:47 +0900, Phil Rhoades wrote:
> > > On Fri, 2008-01-11 at 04:19 +0900, Brian Adkins wrote:
> > > > def block_extractor file, fields,
> > > > splitter = lambda {|line| line.split }
> > > > while !file.eof
> > > > result = []
> > > > fields.each do |field|
> > > > line = file.readline
> > > > result << splitter.call(line.chomp)[field-1] if field
> > > > end
> > > > yield result
> > > > end
> > > > end
> >
> > > > File.open("data.txt", "r") do |file|
> > > > block_extractor(file, [1,2,5,10,12]) do |fields|
> > > > puts fields.join(' ')
> > > > end
> > > > end
> >
> > > Thanks! - now I just need to work out how that actually works and then
> > > work out how I can modify it to use command line parameters.
> >
> > I apologise for replying to my own post but I have had a look at this
> > and read up about Procs and Lambdas and I can sorta see what you are
> > doing but would you be so kind as to elaborate on the code a bit? - I
> > think other people would find it useful as well . .
>
> I'd be glad to. What question do you have?

I'll come back to that after having another look at your code but see
below . .

> > Also, to generalise the code further, how would you select two of more
> > fields from each line?
>
> Well, this is very much a toy/example program, so I wouldn't build on
> it too much. There is a cost and a benefit to generalization, so it
> might be worthwhile to spend some time thinking about how general you
> need the function to be.
>
> A simple way to select two or more fields from each line would be to
> change from:
>
> [a, b, ...]
>
> to:
>
> [ [a1, a2, ...] [b1, b2, ...] ... ]
>
> or possibly use a hash with the key being a block-relative line
> number, and the value being a list of field numbers. Or you may want
> an external specification of the fields to extract - kind of like HTML
> templating in reverse.

While I was waiting I thought I would go ahead and produce something
that would do exactly what I wanted and then get some feedback on it. I
wanted to be able to run a program with parameters eg

multi_line_cvs.rb filename.txt #lines_in_block #fields_in_line arraycell1 arraycell2 arraycell3 . .

like:

/t070.rb infile.txt 5 12 0,0 1,1 2,4 3,9 4,11

So I have produced this:

#!/usr/bin/ruby

filename = ARGV.shift
lib = ARGV.shift.to_i # No. Lines In Block
fil = ARGV.shift.to_i # Max. No. of Fields to read In Line

infile = File::open( filename, 'r' )

count = 0
array = Array.new( lib ) { Array.new( fil ) }

infile.each { |line|
for field in 0..( fil-1 )
array[ count ][ field ] = line.split( "\t" )[ field ].chomp
end

count += 1

if count == lib
output = ''

ARGV.each { |cell|
output << array[cell.split( ',' )[0].to_i][cell.split( ',' )[1].to_i]
output << "\t"
}

output.chop
puts output

count = 0
array = Array.new( lib ) { Array.new( fil ) }
end
}

infile.close

and this actually does just what I want and the output is correct on the
example above ie: "1 12 25 j o"

It obviously needs error handling and there are probably other
suggestions people can make to improve/replace it . .

The original question was whether something that would do this already
existed as a gem or library but it appears not . .

Regards,

Phil.
--
Philip Rhoades

Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
Fax: +61:(0)2-8221-9599
E-mail: phil@pricom.com.au

comp.lang.ruby

Multiline (block) CSV file processing

Phil Rhoades

James Gray

Phil Rhoades

James Gray

Brian Adkins

Brian Adkins

Phil Rhoades

Phil Rhoades

Brian Adkins

Phil Rhoades

x Login to ForumsZone