Asp Forum - Regex Question

seijin

7/24/2007 4:46:00 PM

I recently started studying Ruby (1.8.6) and things are going well.

However, something bothered me when I got to the regular expression
section. This is my first time deal with regular expressions directly
so I'm completely new to things. I understand the idea behind regular
expressions and most of the usage but two of them stuck out for me.

I'm studying the 2nd edition Pragmatic book and using their
show_regexp method...

def show_regexp(a, re)
if a =~ re
"#{$`}<<#{$&}>>#{$'}"
else
"no match"
end
end

....hopefully I don't get in trouble posting that method here. So here
are my two problems...

First, why does ... show_regexp("banana", /(an)*/) ... not match
"anan" ? I thought it was a greedy algorithm that tried to match as
much as possible?

Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
Why in the world does it match "ississ" rather than returning no
match? I think most of my problem with this one is not understanding
the underlying logic used when doing pattern matching with back
references. Does it go through and check "Mississippi", "Mississipp"
down to "M" and then "i", "is", "iss", etc? Like trying all possible
combinations in a lock? I would be extremely grateful if someone
would do a short step-by-step of how it matches "ississ". I
understand what the "\1" does, I just don't understand how the first
part even gets to the first "iss".

Thanks a lot!

6 Answers

Rob Biedenharn

7/24/2007 5:27:00 PM

On Jul 24, 2007, at 12:50 PM, seijin@gmail.com wrote:
> I recently started studying Ruby (1.8.6) and things are going well.
>
> However, something bothered me when I got to the regular expression
> section. This is my first time deal with regular expressions directly
> so I'm completely new to things. I understand the idea behind regular
> expressions and most of the usage but two of them stuck out for me.
>
> I'm studying the 2nd edition Pragmatic book and using their
> show_regexp method...
>
> def show_regexp(a, re)
> if a =~ re
> "#{$`}<<#{$&}>>#{$'}"
> else
> "no match"
> end
> end
>
> ...hopefully I don't get in trouble posting that method here. So here
> are my two problems...
>
> First, why does ... show_regexp("banana", /(an)*/) ... not match
> "anan" ? I thought it was a greedy algorithm that tried to match as
> much as possible?

Well, it would but the * is 'zero or more times' and it can match
zero "an"s before it ever needs to get past the 'b'. If you change
the '*' to a '+' and force 'one or more times', then it has to find
the first "an" and will then greedily consume the next "an" also.

> Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
> Why in the world does it match "ississ" rather than returning no
> match? I think most of my problem with this one is not understanding
> the underlying logic used when doing pattern matching with back
> references. Does it go through and check "Mississippi", "Mississipp"
> down to "M" and then "i", "is", "iss", etc? Like trying all possible
> combinations in a lock? I would be extremely grateful if someone
> would do a short step-by-step of how it matches "ississ". I
> understand what the "\1" does, I just don't understand how the first
> part even gets to the first "iss".
>
> Thanks a lot!

The '+' is greedy so I think it tries \w+ as Mississippi and finds
that there's no following copy, backs up to try Mississipp and finds
that 'i' isn't a copy, and so on back to 'M' before moving past 'M'
to 'ississippi', 'ississipp', 'ississip', etc. until 'iss' has a
following copy and a match is declared. Think of it this way: Until
the regexp gets to the end, it doesn't know that the string isn't
"MississippiMississipp".

-Rob

Rob Biedenharn http://agileconsult...
Rob@AgileConsultingLLC.com

Bill Kelly

7/24/2007 5:31:00 PM

From: <seijin@gmail.com>
>
> def show_regexp(a, re)
> if a =~ re
> "#{$`}<<#{$&}>>#{$'}"
> else
> "no match"
> end
> end
>
> First, why does ... show_regexp("banana", /(an)*/) ... not match
> "anan" ? I thought it was a greedy algorithm that tried to match as
> much as possible?

It is greedy, but it's giving you the *first* match it
can find. Since it's allowed to match zero-or-more times
(because of the *), it's finding a match at the beginning
of the string. From that standpoint, as greedy as it can
be is to match zero characters.

> Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
> Why in the world does it match "ississ" rather than returning no
> match? I think most of my problem with this one is not understanding
> the underlying logic used when doing pattern matching with back
> references. Does it go through and check "Mississippi", "Mississipp"
> down to "M" and then "i", "is", "iss", etc? Like trying all possible
> combinations in a lock? I would be extremely grateful if someone
> would do a short step-by-step of how it matches "ississ". I
> understand what the "\1" does, I just don't understand how the first
> part even gets to the first "iss".

Essentially it scans the string for the first match it can find
that satisfies the expression.

Maybe it's like trying all possible locks that fit the key. ;)

I don't know for sure what order it tries the combinations, but
it's probably something like:

MM
MiMi
MisMis
MissMiss
...
MississippiMississippi
<no match, so advance to next character in string>
ii
isis
ississ
<match found>

Hope this helps,

Bill

Martin DeMello

7/24/2007 5:33:00 PM

On 7/24/07, seijin@gmail.com <seijin@gmail.com> wrote:
>
> First, why does ... show_regexp("banana", /(an)*/) ... not match
> "anan" ? I thought it was a greedy algorithm that tried to match as
> much as possible?

It does, but it also returns the *first* available match, which in the
case of 'banana' is the empty string. Note that * matches 0 or more
iterations. + will work as expected:

>> show_regexp("banana", /(an)+/)
=> "b<<anan>>a"

> Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
> Why in the world does it match "ississ" rather than returning no
> match? I think most of my problem with this one is not understanding
> the underlying logic used when doing pattern matching with back
> references. Does it go through and check "Mississippi", "Mississipp"
> down to "M" and then "i", "is", "iss", etc? Like trying all possible
> combinations in a lock? I would be extremely grateful if someone
> would do a short step-by-step of how it matches "ississ". I
> understand what the "\1" does, I just don't understand how the first
> part even gets to the first "iss".

(\w+)\1 matches one or more letters, twice in a row. The empty string
doesn't match the "one or more bit". This is just the opposite case to
your last question - (\w*)\1 matches the empty string. What the engine
does is match (\w+) against M, then \1 against another M which fails.
Likewise for Mi, Mis, Miss etc, till it hits the end and moves on to
i, is, iss... when it suddenly succeeds. Illustrative examples:

>> show_regexp("Mississippi", /(\w*)\1/)
=> "<<>>Mississippi"
>> show_regexp("MississippiMississippi", /(\w*)\1/)
=> "<<MississippiMississippi>>"
>> show_regexp("bbMississippiMississippi", /(\w*)\1/)
=> "<<bb>>MississippiMississippi"

martin

martin

Martin DeMello

7/24/2007 5:35:00 PM

On 7/24/07, Martin DeMello <martindemello@gmail.com> wrote:
> your last question - (\w*)\1 matches the empty string. What the engine
> does is match (\w+) against M, then \1 against another M which fails.
> Likewise for Mi, Mis, Miss etc, till it hits the end and moves on to
> i, is, iss... when it suddenly succeeds. Illustrative examples:

Oops - Rob is right, of course - + being greedy, it tries \w+ =
Mississippi first, then shrinks from there:

>> show_regexp("MiMississippiMiMississippi", /(\w*)\1/)
=> "<<MiMississippiMiMississippi>>"

martin

seijin

7/24/2007 7:04:00 PM

Oh, I see. So using the "*" with just one ... errr... thing will
never match it? "banana" =~ /a*/ will always be 0 but "banana" =~ /
a*n/ will be 1 (one) ? It's just confusing because I think of
"greedy" as it trying to match as many characters as possible. So
that although "*" tell it to find 0 or more that it would work hard to
return the maximum number of matches. Perhaps the difference between
"Find as many a's as possible" and "Find as many a's or nothing as
possible" ? If you see what I'm trying to say. No, I think I just
figured out what my problem is. I'm taking it out of context. The
"*" was created for use other arguments. I keep getting stuck on
using it with just one argument which is not what it was intended for.

And for Mississippi - is that method of matching only used when using
back references via the \1-9 ? If so, I can set my mind at ease as
that makes sense. The thing that was bothering me was that I didn't
see why it would work so hard to find the match. I thought it was a
simple... (\w+) matches "Mississippi" but there is no
"MississippiMississippi" so I'll return nil... type of thing. I
didn't realize it'd start scanning the string aggresively.

Thanks for the reply!

On Jul 24, 10:27 am, Rob Biedenharn <R...@AgileConsultingLLC.com>
wrote:
> On Jul 24, 2007, at 12:50 PM, sei...@gmail.com wrote:
>
>
>
>
>
> > I recently started studying Ruby (1.8.6) and things are going well.
>
> > However, something bothered me when I got to the regular expression
> > section. This is my first time deal with regular expressions directly
> > so I'm completely new to things. I understand the idea behind regular
> > expressions and most of the usage but two of them stuck out for me.
>
> > I'm studying the 2nd edition Pragmatic book and using their
> > show_regexp method...
>
> > def show_regexp(a, re)
> > if a =~ re
> > "#{$`}<<#{$&}>>#{$'}"
> > else
> > "no match"
> > end
> > end
>
> > ...hopefully I don't get in trouble posting that method here. So here
> > are my two problems...
>
> > First, why does ... show_regexp("banana", /(an)*/) ... not match
> > "anan" ? I thought it was a greedy algorithm that tried to match as
> > much as possible?
>
> Well, it would but the * is 'zero or more times' and it can match
> zero "an"s before it ever needs to get past the 'b'. If you change
> the '*' to a '+' and force 'one or more times', then it has to find
> the first "an" and will then greedily consume the next "an" also.
>
> > Second, how does ... show_regexp("Mississippi", /(\w+)\1/) ... work?
> > Why in the world does it match "ississ" rather than returning no
> > match? I think most of my problem with this one is not understanding
> > the underlying logic used when doing pattern matching with back
> > references. Does it go through and check "Mississippi", "Mississipp"
> > down to "M" and then "i", "is", "iss", etc? Like trying all possible
> > combinations in a lock? I would be extremely grateful if someone
> > would do a short step-by-step of how it matches "ississ". I
> > understand what the "\1" does, I just don't understand how the first
> > part even gets to the first "iss".
>
> > Thanks a lot!
>
> The '+' is greedy so I think it tries \w+ as Mississippi and finds
> that there's no following copy, backs up to try Mississipp and finds
> that 'i' isn't a copy, and so on back to 'M' before moving past 'M'
> to 'ississippi', 'ississipp', 'ississip', etc. until 'iss' has a
> following copy and a match is declared. Think of it this way: Until
> the regexp gets to the end, it doesn't know that the string isn't
> "MississippiMississipp".
>
> -Rob
>
> Rob Biedenharn http://agileconsult...
> R...@AgileConsultingLLC.com- Hide quoted text -
>
> - Show quoted text -

dblack

7/24/2007 8:15:00 PM

comp.lang.ruby

Regex Question

seijin

Rob Biedenharn

Bill Kelly

Martin DeMello

Martin DeMello

seijin

dblack

x Login to ForumsZone