Asp Forum - Newbie regexp question

James Calivar

9/14/2006 11:38:00 PM

Hello,

I'm trying to split a formatted text file into four separate columns.
The data is comprised of lines of text that are bundled into four
distinct columns, corresponding to a "Required versus Optional"
variable, a requirement number, a requirement classification (R1=Rev 1,
F=Future, I=Internal), and a textual description of the requirement.

My raw data looks like this in the input text file:

R [01] R1 The system shall support "emergency call processing"
R [02] R1 The system shall support "local call processing"
R [08] F The system shall provide a command-line user interface
R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
R [11] F The system shall support VoIP networks
R [398] R1 The system shall contain 2 control boards
O [327] I The system should support hotswapping of all internal boards
R [19] I The system shall be able to detect transmission errors
R [631] F The system shall continue processing data as long as a call is
active.

I've set up a loop to process each line in the input file, and what I'd
like to get is four separate variables containing on a line-by-line
basis the data corresponding to the four distinct columns. The problem
is my regexp experience is next to nothing, and I can't figure out how
to extract the data I want since my fourth column contains whitespace
(I'd have used that as my column separator otherwise).

Here's my loop:

File.open(textfile, "r") do |input_file|
while line = input_file.gets
output_file << line
end
end

What can I replace the simple copy statement (output_file << line) with
in order to get what I want?

Thanks in advance, I hope this question makes some sense.

James

--
Posted via http://www.ruby-....

5 Answers

Marcin Mielzynski

9/15/2006 12:08:00 AM

James Calivar wrote:
> Hello,
>
> I'm trying to split a formatted text file into four separate columns.
> The data is comprised of lines of text that are bundled into four
> distinct columns, corresponding to a "Required versus Optional"
> variable, a requirement number, a requirement classification (R1=Rev 1,
> F=Future, I=Internal), and a textual description of the requirement.
>
> My raw data looks like this in the input text file:
>
> R [01] R1 The system shall support "emergency call processing"
> R [02] R1 The system shall support "local call processing"
> R [08] F The system shall provide a command-line user interface
> R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
> R [11] F The system shall support VoIP networks
> R [398] R1 The system shall contain 2 control boards
> O [327] I The system should support hotswapping of all internal boards
> R [19] I The system shall be able to detect transmission errors
> R [631] F The system shall continue processing data as long as a call is
> active.
>

try this one

open("file").read.scan(/(\w)\s+(.+?)\s+(\w+)\s+(.*?)\n?$/){|req,num,cls,dsc|
....}

lopex

Marcin Mielzynski

9/15/2006 12:11:00 AM

Marcin MielÅ¼yÅ?ski wrote:

Ooops,

the newline in regexp is not needed...
>
> try this one
>
> open("file").read.scan(/(\w)\s+(.+?)\s+(\w+)\s+(.*?)$/){|req,num,cls,dsc|
> ...}
>
> lopex

lopex

James Gray

9/15/2006 12:14:00 AM

On Sep 14, 2006, at 6:37 PM, James Calivar wrote:

> What can I replace the simple copy statement (output_file << line)
> with
> in order to get what I want?

My wife, Dana Gray, is still learning Ruby so I gave her this problem
as a test. ;) She suggests the code below.

James Edward Gray II

DATA.each do |line|
line =~ /^(\w)\s+(\S+)\s+(\S+)\s+(.+)/
p [$1, $2, $3, $4]
end

__END__
R [01] R1 The system shall support "emergency call processing"
R [02] R1 The system shall support "local call processing"
R [08] F The system shall provide a command-line user interface
R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
R [11] F The system shall support VoIP networks
R [398] R1 The system shall contain 2 control boards
O [327] I The system should support hotswapping of all internal boards
R [19] I The system shall be able to detect transmission errors
R [631] F The system shall continue processing data as long as a call
is active.

Mike Stok

9/15/2006 12:16:00 AM

On 14-Sep-06, at 7:37 PM, James Calivar wrote:

> Hello,
>
> I'm trying to split a formatted text file into four separate columns.
> The data is comprised of lines of text that are bundled into four
> distinct columns, corresponding to a "Required versus Optional"
> variable, a requirement number, a requirement classification
> (R1=Rev 1,
> F=Future, I=Internal), and a textual description of the requirement.
>
> My raw data looks like this in the input text file:
>
> R [01] R1 The system shall support "emergency call processing"
> R [02] R1 The system shall support "local call processing"
> R [08] F The system shall provide a command-line user interface
> R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
> R [11] F The system shall support VoIP networks
> R [398] R1 The system shall contain 2 control boards
> O [327] I The system should support hotswapping of all internal boards
> R [19] I The system shall be able to detect transmission errors
> R [631] F The system shall continue processing data as long as a
> call is
> active.
>
> I've set up a loop to process each line in the input file, and what
> I'd
> like to get is four separate variables containing on a line-by-line
> basis the data corresponding to the four distinct columns. The
> problem
> is my regexp experience is next to nothing, and I can't figure out how
> to extract the data I want since my fourth column contains whitespace
> (I'd have used that as my column separator otherwise).
>
> Here's my loop:
>
> File.open(textfile, "r") do |input_file|
> while line = input_file.gets
> output_file << line
> end
> end
>
> What can I replace the simple copy statement (output_file << line)
> with
> in order to get what I want?
>
> Thanks in advance, I hope this question makes some sense.

You have a number of options - if your data is tab delimited (i.e.
the first "two" coluumns are really one):

s = 'R [01] R1 The system shall support "emergency call processing"'
p s.split(/\t/)

=> ["R [01]", "R1", "The system shall support \"emergency call
processing\""]

or you can just split on whitespace and specify a limit on the number
of fields:

s = 'R [01] R1 The system shall support "emergency call processing"'
p s.split(/\s+/, 4)

=> ["R", "[01]", "R1", "The system shall support \"emergency call
processing\""]

Or you can use a regex (ick ;-)

Hope this helps,

Mike

--

Mike Stok <mike@stok.ca>
http://www.stok...

The "`Stok' disclaimers" apply.

Steven Hansen

9/15/2006 2:47:00 PM

I suck at regex too, I tried this as an exercise and came up with the
below. It's less concise than previous solutions, but it works as far
as I can tell:

Row = Struct.new(:col1, :col2, :col3, :col4)
rows = Array.new()
regex = /([A-Z])\s(\[[0-9]+\])\s([A-Z1-9]+)\s(.+)/

File.open("file.txt") do |file|
while (line = file.gets)
m = line.match(regex)
rows << Row.new(m[1], m[2], m[3], m[4])
end
end

puts rows.flatten

#output =>

#<struct Row col1="R", col2="[01]", col3="R1", col4="The system shall
support \"emergency call processing\"">
#<struct Row col1="R", col2="[02]", col3="R1", col4="The system shall
support \"local call processing\"">
#<struct Row col1="R", col2="[08]", col3="F", col4="The system shall
provide a command-line user interface">
#<struct Row col1="R", col2="[723]", col3="F", col4="The system shall
provide 6 10/100/1000 Ethernet interfaces">
#<struct Row col1="R", col2="[11]", col3="F", col4="The system shall
support VoIP networks">
#<struct Row col1="R", col2="[398]", col3="R1", col4="The system shall
contain 2 control boards">
#<struct Row col1="O", col2="[327]", col3="I", col4="The system should
support hotswapping of all internal boards">
#<struct Row col1="R", col2="[19]", col3="I", col4="The system shall be
able to detect transmission errors">
#<struct Row col1="R", col2="[631]", col3="F", col4="The system shall
continue processing data as long as a call is active.">

-Steven

James Calivar wrote:
> Hello,
>
> I'm trying to split a formatted text file into four separate columns.
> The data is comprised of lines of text that are bundled into four
> distinct columns, corresponding to a "Required versus Optional"
> variable, a requirement number, a requirement classification (R1=Rev 1,
> F=Future, I=Internal), and a textual description of the requirement.
>
> My raw data looks like this in the input text file:
>
> R [01] R1 The system shall support "emergency call processing"
> R [02] R1 The system shall support "local call processing"
> R [08] F The system shall provide a command-line user interface
> R [723] F The system shall provide 6 10/100/1000 Ethernet interfaces
> R [11] F The system shall support VoIP networks
> R [398] R1 The system shall contain 2 control boards
> O [327] I The system should support hotswapping of all internal boards
> R [19] I The system shall be able to detect transmission errors
> R [631] F The system shall continue processing data as long as a call is
> active.
>
> I've set up a loop to process each line in the input file, and what I'd
> like to get is four separate variables containing on a line-by-line
> basis the data corresponding to the four distinct columns. The problem
> is my regexp experience is next to nothing, and I can't figure out how
> to extract the data I want since my fourth column contains whitespace
> (I'd have used that as my column separator otherwise).
>
> Here's my loop:
>
> File.open(textfile, "r") do |input_file|
> while line = input_file.gets
> output_file << line
> end
> end
>
> What can I replace the simple copy statement (output_file << line) with
> in order to get what I want?
>
> Thanks in advance, I hope this question makes some sense.
>
> James
>
>

comp.lang.ruby

Newbie regexp question

James Calivar

Marcin Mielzynski

Marcin Mielzynski

James Gray

Mike Stok

Steven Hansen

x Login to ForumsZone