sukhchander
8/12/2006 8:26:00 PM
Hi,
I worked on the regexp some more before I saw everyone's response.
I was able to extract all parts except for the day hour. I was treating
- as "-" and the literals A as "A" and P as "P" so I didn't hit any
matches.
I see you created line breaks with each component of the REGEXP. I will
follow that convention from now on.
I also now understand the difference between .* and \s+ as many of you
have pointed out.
I'm new to ruby as well and will continue to expreriment with it some
more.
Thanks for your responses.
[sukhchander]
Jan Svitok wrote:
> On 8/12/06, skelastic@gmail.com <skelastic@gmail.com> wrote:
> > Greetings,
> >
> > I'm trying to parse the following line:
> > ...
>
> Hi,
>
> although in this case I'd prefer the array.split solution here's how
> it can be done in case you really need regex:
> These are incremental versions of the regex, and a test to check them.
> Save to file and enjoy!
>
> Jano
>
> #!/usr/bin/ruby
> require 'test/unit'
>
> DATA = "00608 P 135 001 LEC Tu 2-5P 210 WHEELER IT and Society 3 LAGUERRE"
>
> REGEX1 = /(\d{5}).(\D\s\w{2,4}).(\d{1,4}\s\D{3}).(\D{1,4}\s\d+).*(\d{1,4}\s\D{1,9}).(\w+).*(\d?).*(\w{1,14}).*/
>
> # add /x and # comments,
> REGEX2 = /
> (\d{5}). # control
> (\D\s\w{2,4}). # course
> (\d{1,4}\s\D{3}). # section
> (\D{1,4}\s\d+).* # day-hour
> (\d{1,4}\s\D{1,9}). # room
> (\w+).* # course-name
> (\d?).* # credits
> (\w{1,14}).* # professor
> /x
>
> # now we'll fix the day-hour: add -\d[AP] - that will match a slash,
> # a digit and either 'A' or 'P'
> # and fix for the room: \D{1,9} replace with \w+
> REGEX3 = /
> (\d{5}). # control
> (\D\s\w{2,4}). # course
> (\d{1,4}\s\D{3}). # section
> (\D{1,4}\s\d+-\d[AP]). # day-hour
> (\d{1,4}\s\w+). # room
> (\w+).* # course-name
> (\d?).* # credits
> (\w{1,14}).* # professor
> /x
>
> # To fix course name, the previous tricks aren't enough --
> # there are many words, with different length. So what we'll do?
> # We'll parse the things at the end: credits and professor
> #
> # To see the results, temporarily comment out the lines
> # that checks the course name and credits in the test
> # and run it with REGEX3.
> #
> # To fix the professor, we'll say that it's tha last word on the line:
> # notice the \s+ before the professor group - there has to be something
> # fixed that separates the name from the rest - .* won't do it.
> REGEX4 = /
> (\d{5}). # control
> (\D\s\w{2,4}). # course
> (\d{1,4}\s\D{3}). # section
> (\D{1,4}\s\d+-\d[AP]). # day-hour
> (\d{1,4}\s\w+). # room
> (\w+).* # course-name
> (\d?)\s+ # credits
> (\w+)\s*$ # professor
> /x
>
> # Now we can try the rest two pieces: uncomment credits and
> # we'll see that they are already ok, so uncomment course name as well.
> #
> # Only the first word appears. So we'll move .* inside the parentheses
> # and add a separating \s+
> #
> # Finally some small touches:
> # replace separating . with \s+
> REGEX5 = /
> (\d{5})\s+ # control
> (\D\s\w{2,4})\s+ # course
> (\d{1,4}\s\D{3})\s+ # section
> (\D{1,4}\s\d+-\d[AP])\s+# day-hour
> (\d{1,4}\s\w+)\s+ # room
> (.*)\s+ # course-name
> (\d+)\s+ # credits
> (\w+)\s*$ # professor
> /x
>
> class TestRegex < Test::Unit::TestCase
> def test_regex
> assert DATA =~ REGEX1 # <--- change number here
> assert_equal "00608", $1
> assert_equal "P 135", $2
> assert_equal "001 LEC", $3
> assert_equal "Tu 2-5P", $4
> assert_equal "210 WHEELER", $5
> assert_equal "IT and Society", $6
> assert_equal "3", $7
> assert_equal "LAGUERRE", $8
> end
> end