Asp Forum - Re: text processing

James Gray

3/30/2007 11:59:00 AM

On Mar 29, 2007, at 10:37 PM, Stephen Smith wrote:

> So I have about a half million records in a non-standard, non-
> delimited
> format. The fields are fixed widths, each record is separated by a
> new line.

Here's another option:

>> xs, ys, zs = "XXXYYYYYZZ".unpack("A3A5A2")
=> ["XXX", "YYYYY", "ZZ"]
>> xs
=> "XXX"
>> ys
=> "YYYYY"
>> zs
=> "ZZ"

Hope that helps.

James Edward Gray II

8 Answers

Andrew Stewart

3/30/2007 2:14:00 PM

On 30 Mar 2007, at 12:59, James Edward Gray II wrote:
> Here's another option:
>
> >> xs, ys, zs = "XXXYYYYYZZ".unpack("A3A5A2")
> => ["XXX", "YYYYY", "ZZ"]
> >> xs
> => "XXX"
> >> ys
> => "YYYYY"
> >> zs
> => "ZZ"

I like the way your mind works. Nice!

Regards,
Andy Stewart

James Gray

3/30/2007 7:48:00 PM

On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:

> But I think that the format Harry and Gary suggested clearly
> represents the pattern I'm matching.

Just FYI, when speed matters it may be worth using unpack:

#!/usr/bin/env ruby -w

require "benchmark"

TESTS = 100_000
LINE = "XYZ" * 100
Benchmark.bmbm do |results|
results.report("regex:") do
TESTS.times do
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
{9})/.match(LINE).captures.join(",")
end
end
results.report("unpack:") do
TESTS.times do
LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
end
end
end
# >> Rehearsal -------------------------------------------
# >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
# >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
# >> ---------------------------------- total: 2.820000sec
# >>
# >> user system total real
# >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
# >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)

__END__

James Edward Gray II

Gary Wright

3/30/2007 8:15:00 PM

On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
> line = ""
> record = ""
> oldLog = File.open("filename.txt")
> newLog = File.new("filename_formatted.csv", "w")
> hdr = "field1, field2, field3,...."
> newLog << hdr
> arr = oldLog.readlines
> arr.each do |line|
> record <<
> /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
> {9})/.match(line).captures.join(',')
> << "\n"
> newLogg << record
> record = ""
> end

How about making it so you don't have to type that pattern next time:

def columns(*widths)
Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)

You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):

oldLog.each { |line|
newLog.print pattern.match(line).captures.join(',')
}

Bernard Kenik

3/30/2007 9:08:00 PM

On Mar 30, 3:47 pm, James Edward Gray II <j...@grayproductions.net>
wrote:
> On Mar 30, 2007, at 2:31 PM, Stephen Smith wrote:
>
> > But I think that the format Harry and Gary suggested clearly
> > represents the pattern I'm matching.
>
> Just FYI, when speed matters it may be worth using unpack:
>
> #!/usr/bin/env ruby -w
>
> require "benchmark"
>
> TESTS = 100_000
> LINE = "XYZ" * 100
> Benchmark.bmbm do |results|
> results.report("regex:") do
> TESTS.times do
> /(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
> {9})/.match(LINE).captures.join(",")
> end
> end
> results.report("unpack:") do
> TESTS.times do
> LINE.unpack("A8A6A15A3A30A4A15A1A12A9").join(",")
> end
> end
> end
> # >> Rehearsal -------------------------------------------
> # >> regex: 2.060000 0.000000 2.060000 ( 2.067150)
> # >> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
> # >> ---------------------------------- total: 2.820000sec
> # >>
> # >> user system total real
> # >> regex: 2.040000 0.000000 2.040000 ( 2.046938)
> # >> unpack: 0.770000 0.000000 0.770000 ( 0.763620)
>
> __END__
>
> James Edward Gray II

your LINE is 300 characters but you are only unpacking 103 ??????

James Gray

3/30/2007 9:32:00 PM

On Mar 30, 2007, at 4:10 PM, bbiker wrote:

> your LINE is 300 characters but you are only unpacking 103 ??????

Yeah, I was too lazy to count it, so I just picked something big
enough. ;)

James Edward Gray II

Brian Candler

3/31/2007 8:20:00 AM

On Sat, Mar 31, 2007 at 04:31:56AM +0900, Stephen Smith wrote:
> So I've cleaned up the regular expression, and I like the simplicity of the
> upack message.
>
> But I think that the format Harry and Gary suggested clearly represents the
> pattern I'm matching.
>
> Since the pattern may change, and/or we may get other data dumps from this
> supplier in the future, I think keeping it represented clearly in one place
> will help with maintenance.

Which could be a constant at the top of the source:

LINE_PATTERN = /^(...)(..)(....)/

Then later on in your code you can say:

record << LINE_PATTERN.match(line).captures.join(',') << "\n"

Regards,

Brian.

Ruby Kk

3/7/2009 2:39:00 AM

Hi there,
I am new to ruby and trying to create a flat file with fixed column
length. i have a header,footer and records that i want in the file.

any ideas?

Cheers!

Gary Wright wrote:
> On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
>> {9})/.match(line).captures.join(',')
>> << "\n"
>> newLogg << record
>> record = ""
>> end
>
> How about making it so you don't have to type that pattern next time:
>
> def columns(*widths)
> Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
> end
> pattern = columns(6,15,3,30,4,15,1,12,9)
>
> You also can get rid of the intermediate strings and the potentially
> huge
> internal array (from slurping up your oldLog into an array):
>
> oldLog.each { |line|
> newLog.print pattern.match(line).captures.join(',')
> }

--
Posted via http://www.ruby-....

Chris Hulan

3/7/2009 4:37:00 AM

On Mar 6, 9:38 pm, Ruby Kk <kumar...@gmail.com> wrote:
> Hi there,
> I am new to ruby and trying to create a flat file with fixed column
> length. i have a header,footer and records that i want in the file.
>
> any ideas?
>
> Cheers!
>
>
>
> Gary Wright wrote:
> > On Mar 30, 2007, at 3:31 PM, Stephen Smith wrote:
> >> {9})/.match(line).captures.join(',')
> >> << "\n"
> >> newLogg << record
> >> record = ""
> >> end
>
> > How about making it so you don't have to type that pattern next time:
>
> > def columns(*widths)
> > Regexp.new(widths.map {|count| "(.{#{count}})" }.join)
> > end
> > pattern = columns(6,15,3,30,4,15,1,12,9)
>
> > You also can get rid of the intermediate strings and the potentially
> > huge
> > internal array (from slurping up your oldLog into an array):
>
> > oldLog.each { |line|
> > newLog.print pattern.match(line).captures.join(',')
> > }
>
> --
> Posted viahttp://www.ruby-....

sprintf allows you specify formatting that can be used to print fixed
width.
An example from the docs (ruby-doc.org) is:
sprintf("%08b '%4s'", 123, 123) #=> "01111011 ' 123'"
Note how the %4s puts 3 characters in a space 4 characters wide.

cheers

comp.lang.ruby

Re: text processing

James Gray

Andrew Stewart

James Gray

Gary Wright

Bernard Kenik

James Gray

Brian Candler

Ruby Kk

Chris Hulan

x Login to ForumsZone