Asp Forum - Regexp Parsing -- What's the right way?

skelastic

8/12/2006 5:46:00 AM

Greetings,

I'm trying to parse the following line:

"00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
LAGUERRE"

i've constructed the following regexp:
/(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/

with a input file i've successfully produced the following output:
control# 00608 ---- correct
course#: P 135 ---- correct
section#: 001 LEC ---- correct
day-hour#: Tu 2 ---- missing '-5P
room#: 3 LAGUERRE, ---- should be 210 WHEELER
course-name#: M --- IT and Soceity
credits#: --- should be 3
prof#: 5 --- should be LAGUERRE

i'm a novice to ruby and regexp. i would like to know if i'm taking the
right approach.
i'll eventually nail it but any hints or suggestions would be useful.

appreciate the help.

6 Answers

Simon Kröger

8/12/2006 7:33:00 AM

skelastic@gmail.com wrote:
> Greetings,
>
> I'm trying to parse the following line:
>
> "00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
> LAGUERRE"
>
> i've constructed the following regexp:
> /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/
>
> with a input file i've successfully produced the following output:
> control# 00608 ---- correct
> course#: P 135 ---- correct
> section#: 001 LEC ---- correct
> day-hour#: Tu 2 ---- missing '-5P
> room#: 3 LAGUERRE, ---- should be 210 WHEELER
> course-name#: M --- IT and Soceity
> credits#: --- should be 3
> prof#: 5 --- should be LAGUERRE
>
> i'm a novice to ruby and regexp. i would like to know if i'm taking the
> right approach.
> i'll eventually nail it but any hints or suggestions would be useful.
>
> appreciate the help.

I would go with split in this case:

t = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
LAGUERRE"
a = t.split
#strip from the beginning
control = a.shift
course = a.shift + ' ' + a.shift
section = a.shift + ' ' + a.shift
hour = a.shift + ' ' + a.shift
room = a.shift + ' ' + a.shift
#strip from behind
prof = a.pop
credits = a.pop
#the rest is the name
coursen = a.join(' ')

puts "control: #{control}"
puts "course: #{course}"
puts "section: #{section}"
puts "hour: #{hour}"
puts "room: #{room}"
puts "coursen: #{coursen}"
puts "credits: #{credits}"
puts "prof: #{prof}"

cheers

Simon

Jano Svitok

8/12/2006 9:10:00 AM

On 8/12/06, skelastic@gmail.com <skelastic@gmail.com> wrote:
> Greetings,
>
> I'm trying to parse the following line:
> ...

Hi,

although in this case I'd prefer the array.split solution here's how
it can be done in case you really need regex:
These are incremental versions of the regex, and a test to check them.
Save to file and enjoy!

Jano

#!/usr/bin/ruby
require 'test/unit'

DATA = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER IT and Society 3 LAGUERRE"

REGEX1 = /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/

# add /x and # comments,
REGEX2 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d+).* # day-hour
(\d{1,4}\s\D{1,9}). # room
(\w+).* # course-name
(\d?).* # credits
(\w{1,14}).* # professor
/x

# now we'll fix the day-hour: add -\d[AP] - that will match a slash,
# a digit and either 'A' or 'P'
# and fix for the room: \D{1,9} replace with \w+
REGEX3 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d+-\d[AP]). # day-hour
(\d{1,4}\s\w+). # room
(\w+).* # course-name
(\d?).* # credits
(\w{1,14}).* # professor
/x

# To fix course name, the previous tricks aren't enough --
# there are many words, with different length. So what we'll do?
# We'll parse the things at the end: credits and professor
#
# To see the results, temporarily comment out the lines
# that checks the course name and credits in the test
# and run it with REGEX3.
#
# To fix the professor, we'll say that it's tha last word on the line:
# notice the \s+ before the professor group - there has to be something
# fixed that separates the name from the rest - .* won't do it.
REGEX4 = /
(\d{5}). # control
(\D\s\w{2,4}). # course
(\d{1,4}\s\D{3}). # section
(\D{1,4}\s\d+-\d[AP]). # day-hour
(\d{1,4}\s\w+). # room
(\w+).* # course-name
(\d?)\s+ # credits
(\w+)\s*$ # professor
/x

# Now we can try the rest two pieces: uncomment credits and
# we'll see that they are already ok, so uncomment course name as well.
#
# Only the first word appears. So we'll move .* inside the parentheses
# and add a separating \s+
#
# Finally some small touches:
# replace separating . with \s+
REGEX5 = /
(\d{5})\s+ # control
(\D\s\w{2,4})\s+ # course
(\d{1,4}\s\D{3})\s+ # section
(\D{1,4}\s\d+-\d[AP])\s+# day-hour
(\d{1,4}\s\w+)\s+ # room
(.*)\s+ # course-name
(\d+)\s+ # credits
(\w+)\s*$ # professor
/x

class TestRegex < Test::Unit::TestCase
def test_regex
assert DATA =~ REGEX1 # <--- change number here
assert_equal "00608", $1
assert_equal "P 135", $2
assert_equal "001 LEC", $3
assert_equal "Tu 2-5P", $4
assert_equal "210 WHEELER", $5
assert_equal "IT and Society", $6
assert_equal "3", $7
assert_equal "LAGUERRE", $8
end
end

Robert Klemme

8/12/2006 10:31:00 AM

skelastic@gmail.com wrote:
> Greetings,
>
> I'm trying to parse the following line:
>
> "00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity
> 3 LAGUERRE"
>
> i've constructed the following regexp:
> /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/
>
> with a input file i've successfully produced the following output:
> control# 00608 ---- correct
> course#: P 135 ---- correct
> section#: 001 LEC ---- correct
> day-hour#: Tu 2 ---- missing '-5P
> room#: 3 LAGUERRE, ---- should be 210 WHEELER
> course-name#: M --- IT and Soceity
> credits#: --- should be 3
> prof#: 5 --- should be LAGUERRE
>
> i'm a novice to ruby and regexp. i would like to know if i'm taking
> the right approach.
> i'll eventually nail it but any hints or suggestions would be useful.
>
> appreciate the help.

Looks pretty ok to me apart from that I'd use \s instead of . to parse white
space separating entries.

robert

sukhchander

8/12/2006 8:26:00 PM

Hi,

I worked on the regexp some more before I saw everyone's response.
I was able to extract all parts except for the day hour. I was treating
- as "-" and the literals A as "A" and P as "P" so I didn't hit any
matches.

I see you created line breaks with each component of the REGEXP. I will
follow that convention from now on.

I also now understand the difference between .* and \s+ as many of you
have pointed out.

I'm new to ruby as well and will continue to expreriment with it some
more.

Thanks for your responses.
[sukhchander]

Jan Svitok wrote:
> On 8/12/06, skelastic@gmail.com <skelastic@gmail.com> wrote:
> > Greetings,
> >
> > I'm trying to parse the following line:
> > ...
>
> Hi,
>
> although in this case I'd prefer the array.split solution here's how
> it can be done in case you really need regex:
> These are incremental versions of the regex, and a test to check them.
> Save to file and enjoy!
>
> Jano
>
> #!/usr/bin/ruby
> require 'test/unit'
>
> DATA = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER IT and Society 3 LAGUERRE"
>
> REGEX1 = /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/
>
> # add /x and # comments,
> REGEX2 = /
> (\d{5}). # control
> (\D\s\w{2,4}). # course
> (\d{1,4}\s\D{3}). # section
> (\D{1,4}\s\d+).* # day-hour
> (\d{1,4}\s\D{1,9}). # room
> (\w+).* # course-name
> (\d?).* # credits
> (\w{1,14}).* # professor
> /x
>
> # now we'll fix the day-hour: add -\d[AP] - that will match a slash,
> # a digit and either 'A' or 'P'
> # and fix for the room: \D{1,9} replace with \w+
> REGEX3 = /
> (\d{5}). # control
> (\D\s\w{2,4}). # course
> (\d{1,4}\s\D{3}). # section
> (\D{1,4}\s\d+-\d[AP]). # day-hour
> (\d{1,4}\s\w+). # room
> (\w+).* # course-name
> (\d?).* # credits
> (\w{1,14}).* # professor
> /x
>
> # To fix course name, the previous tricks aren't enough --
> # there are many words, with different length. So what we'll do?
> # We'll parse the things at the end: credits and professor
> #
> # To see the results, temporarily comment out the lines
> # that checks the course name and credits in the test
> # and run it with REGEX3.
> #
> # To fix the professor, we'll say that it's tha last word on the line:
> # notice the \s+ before the professor group - there has to be something
> # fixed that separates the name from the rest - .* won't do it.
> REGEX4 = /
> (\d{5}). # control
> (\D\s\w{2,4}). # course
> (\d{1,4}\s\D{3}). # section
> (\D{1,4}\s\d+-\d[AP]). # day-hour
> (\d{1,4}\s\w+). # room
> (\w+).* # course-name
> (\d?)\s+ # credits
> (\w+)\s*$ # professor
> /x
>
> # Now we can try the rest two pieces: uncomment credits and
> # we'll see that they are already ok, so uncomment course name as well.
> #
> # Only the first word appears. So we'll move .* inside the parentheses
> # and add a separating \s+
> #
> # Finally some small touches:
> # replace separating . with \s+
> REGEX5 = /
> (\d{5})\s+ # control
> (\D\s\w{2,4})\s+ # course
> (\d{1,4}\s\D{3})\s+ # section
> (\D{1,4}\s\d+-\d[AP])\s+# day-hour
> (\d{1,4}\s\w+)\s+ # room
> (.*)\s+ # course-name
> (\d+)\s+ # credits
> (\w+)\s*$ # professor
> /x
>
> class TestRegex < Test::Unit::TestCase
> def test_regex
> assert DATA =~ REGEX1 # <--- change number here
> assert_equal "00608", $1
> assert_equal "P 135", $2
> assert_equal "001 LEC", $3
> assert_equal "Tu 2-5P", $4
> assert_equal "210 WHEELER", $5
> assert_equal "IT and Society", $6
> assert_equal "3", $7
> assert_equal "LAGUERRE", $8
> end
> end

sukhchander

8/12/2006 8:29:00 PM

Hi Simon,

That's pretty cool.
I was looking for a utility similar to Java's StringTokenizer. You just
pointed it out.
Ruby has so many things built in. It's very comprehensive.

For larger regexp I assume you prefer the split/tokenize method?

I went with the Regexp approach because it occurred to me first.

Thanks.
[sukhchander]

Simon Kröger wrote:
> skelastic@gmail.com wrote:
> > Greetings,
> >
> > I'm trying to parse the following line:
> >
> > "00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
> > LAGUERRE"
> >
> > i've constructed the following regexp:
> > /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/
> >
> > with a input file i've successfully produced the following output:
> > control# 00608 ---- correct
> > course#: P 135 ---- correct
> > section#: 001 LEC ---- correct
> > day-hour#: Tu 2 ---- missing '-5P
> > room#: 3 LAGUERRE, ---- should be 210 WHEELER
> > course-name#: M --- IT and Soceity
> > credits#: --- should be 3
> > prof#: 5 --- should be LAGUERRE
> >
> > i'm a novice to ruby and regexp. i would like to know if i'm taking the
> > right approach.
> > i'll eventually nail it but any hints or suggestions would be useful.
> >
> > appreciate the help.
>
> I would go with split in this case:
>
> t = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER Information Tech and Soceity 3
> LAGUERRE"
> a = t.split
> #strip from the beginning
> control = a.shift
> course = a.shift + ' ' + a.shift
> section = a.shift + ' ' + a.shift
> hour = a.shift + ' ' + a.shift
> room = a.shift + ' ' + a.shift
> #strip from behind
> prof = a.pop
> credits = a.pop
> #the rest is the name
> coursen = a.join(' ')
>
> puts "control: #{control}"
> puts "course: #{course}"
> puts "section: #{section}"
> puts "hour: #{hour}"
> puts "room: #{room}"
> puts "coursen: #{coursen}"
> puts "credits: #{credits}"
> puts "prof: #{prof}"
>
> cheers
>
> Simon

Robert Klemme

8/12/2006 10:11:00 PM

sukhchander wrote:
> Hi Simon,
>
> That's pretty cool.
> I was looking for a utility similar to Java's StringTokenizer. You just
> pointed it out.
> Ruby has so many things built in. It's very comprehensive.
>
> For larger regexp I assume you prefer the split/tokenize method?
>
> I went with the Regexp approach because it occurred to me first.

Personally I'd stick with the regexp approach as it has these advantages:

- probably faster because you don't have to split and then combine again

- more precise with regard to matching, i.e. you can better define
where to match plus you get the info whether the input string is
properly formatted

Btw, if you want to dive into regexp I can recommend "Mastering Regular
Expressions". It's probably best to first get some basic knowledge of
RX but if you want to know how to build efficient RX etc. then that book
is definitive a great help. Ah, I get carried away...

Then there's also tool programs that help in understanding RX visually.
RegexBuddy and Regex-Coach.

Kind regards

robert

comp.lang.ruby

Regexp Parsing -- What's the right way?

skelastic

Simon Kröger

Jano Svitok

Robert Klemme

sukhchander

sukhchander

Robert Klemme

x Login to ForumsZone