Asp Forum - Do You Understand Regular Expressions?

growlatoe

6/20/2007 10:17:00 PM

Hi all.

I'm pretty new to Ruby and that sort of thing, and I'm having a few
problems understanding regular expressions. I'm hoping one of you can
point me in the right direction.

I want to replace an entire string with another string. I know you
don't need regular expressions for that, but it's part of a more
generic approach. Anyway, the problem I'm having is that my regular
expressions are finding two matches instead of one, and I don't
understand why. I've narrowed down my confusion to the following code,
which shows some output from irb:

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

The same thing can be seen when substituting - this is closer to how
I'm using regular expressions in my code:

irb(main):001:0> "hello".gsub(/.*/, "P")
=> "PP"

Two substitutions are made and I expected one. So am I right or wrong
to expect one substitution?

Please help - this is driving me nuts!

And in case it helps...

$ ruby --version
ruby 1.8.5 (2006-08-25) [i486-linux]

Thanks in advance.

18 Answers

Tim Hunter

6/20/2007 10:31:00 PM

growlatoe@yahoo.co.uk wrote:
> irb(main):001:0> "hello".scan(/.*/)
> => ["hello", ""]
>
> I was expecting one match, not two, because .* matches everything,
> right? Can someone explain why an empty string is also matched?
>
Try anchoring the match: /^.*/

--
RMagick OS X Installer [http://rubyforge.org/project...]
RMagick Hints & Tips [http://rubyforge.org/forum/forum.php?for...]
RMagick Installation FAQ [http://rmagick.rubyforge.org/instal...]

Axel Etzold

6/20/2007 10:50:00 PM

> irb(main):001:0> "hello".scan(/.*/)
> => ["hello", ""]
>
> I was expecting one match, not two, because .* matches everything,
> right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

You can search for at least one occurrence like this:

"hello".scan(/.+/)

"hello".gsub(/.+/, "P") => 'P'

As an introduction, I find

http://www.regular-expressions.info...

quite instructive for the use of regexps in Ruby.

Best regards,

Axel
--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kanns mit allen: http://www.gmx.net/de/go/mult...

Daniel DeLorme

6/20/2007 11:16:00 PM

Axel Etzold wrote:
>> irb(main):001:0> "hello".scan(/.*/)
>> => ["hello", ""]
>>
>> I was expecting one match, not two, because .* matches everything,
>> right? Can someone explain why an empty string is also matched?
>
> String.scan searches for all occurrences of (any number of any
> character) here. So zero occurrences is one match.

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

Ryan Mcdonald

6/21/2007 12:49:00 AM

Daniel DeLorme wrote:
> Axel Etzold wrote:
>>> irb(main):001:0> "hello".scan(/.*/)
>>> => ["hello", ""]
>>>
>>> I was expecting one match, not two, because .* matches everything,
>>> right? Can someone explain why an empty string is also matched?
>>
>> String.scan searches for all occurrences of (any number of any
>> character) here. So zero occurrences is one match.
>
> That doesn't really explain why the regexp finds an extra empty string.
> I know that zero occurrences is one match but after a greedy match that
> matches everything, there should be (logically?) no other match. I am no
> stranger to regexps and the result is counter-intuitive to me; I would
> consider it a bug. Or at least a very very peculiar behavior.
>
> Daniel

I agree. Can someone explain why gsub, sub or scan matches with * are
different than =~ matches with *

puts "hello".gsub(/[aeiou]/, '<\1>') # h<>ll<>
puts "hello".gsub(/.*/, '<\1>') # <><>
print "before: #{$`}\n" # before: hello
print "match: #{$&}\n" # match:
print "after: #{$'}\n" # after:

puts "hello" =~ (/.*/) # 0
print "before: #{$`}\n" # before:
print "match: #{$&}\n" # match: hello
print "after: #{$'}\n" # after:

thanks!

--
Posted via http://www.ruby-....

Karl-Heinz Wild

6/21/2007 8:43:00 AM

Hello Ryan

In message "Do You Understand Regular Expressions?"
on 21.06.2007, Ryan Mcdonald <ryemcdonald@gmail.com> writes:

RM> I agree. Can someone explain why gsub, sub or scan matches with * are
RM> different than =~ matches with *

RM> puts "hello".gsub(/[aeiou]/, '<\1>') # h<>ll<>

irb(main):024:0> "hello".gsub( /([aeiou])/, "<\\1>" )

Please note the () around the expression.
After that you can refer with \\1 to the found
letters.

RM> puts "hello".gsub(/.*/, '<\1>') # <><>

irb(main):029:0> "hello".gsub(/(.*)/, '<\1>')
=> "<hello><>"
irb(main):030:0> "hello".gsub(/(.+)/, '<\1>')
=> "<hello>"

RM> print "before: #{$`}\n" # before: hello

irb(main):031:0> $`
=> ""

RM> print "match: #{$&}\n" # match:

irb(main):032:0> $&
=> "hello"

RM> print "after: #{$'}\n" # after:

irb(main):033:0> $'
=> ""

hope this helps.

regards.
Karl-Heinz

Stephen Ball

6/21/2007 1:47:00 PM

On 6/20/07, Daniel DeLorme <dan-ml@dan42.com> wrote:
> That doesn't really explain why the regexp finds an extra empty string.
> I know that zero occurrences is one match but after a greedy match that
> matches everything, there should be (logically?) no other match. I am no
> stranger to regexps and the result is counter-intuitive to me; I would
> consider it a bug. Or at least a very very peculiar behavior.
>
> Daniel
>

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

So: since * matches "zero or more" characters when it starts the
search for .* it matches the absence (the 'zero') and then matches the
string (the 'or more').

To prevent this you need to indicate to your regular expression that
you only want the subset of 'everything' that is actually something.
Here are a couple ways to do this:

/.+/ will match 1 or more of something, so doesn't return the absence

/^.*/ will start the search at the start of the pattern, in a way
bypassing the match of zero (the pattern /^.*$/ makes this more
clear).

/..*/ will match everything after something, this is a modified form
of the above that isn't tied to the start of the string

-- Stephen

Rob Biedenharn

6/21/2007 2:13:00 PM

On Jun 21, 2007, at 9:47 AM, Stephen Ball wrote:

> On 6/20/07, Daniel DeLorme <dan-ml@dan42.com> wrote:
>> That doesn't really explain why the regexp finds an extra empty
>> string.
>> I know that zero occurrences is one match but after a greedy match
>> that
>> matches everything, there should be (logically?) no other match. I
>> am no
>> stranger to regexps and the result is counter-intuitive to me; I
>> would
>> consider it a bug. Or at least a very very peculiar behavior.
>>
>> Daniel
>
> It's because the pattern /.*/ matches everything, including the
> absence of everything. Yes, with the proper regexs you can indeed have
> tea and no tea at the same time. Certainly peculiar, but occasionally
> useful.
> ...
> -- Stephen

That still doesn't really explain why "hello".scan(/.*/) => ["hello",
""]

Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "",
"", ... ] since I (or rather the OP) could continue to match zero
characters (bytes) at the end of the string forever? It does seem
that it might be that a termination condition is checked a bit later
than it should be in this case.

-Rob

Rob Biedenharn http://agileconsult...
Rob@AgileConsultingLLC.com

dblack

6/21/2007 2:26:00 PM

Brian Adkins

6/21/2007 2:42:00 PM

On Jun 21, 4:43 am, Wild Karl-Heinz <kh.w...@wicom.li> wrote:
> Hello Ryan
>
> In message "Do You Understand Regular Expressions?"
> on 21.06.2007, Ryan Mcdonald <ryemcdon...@gmail.com> writes:
>
> RM> I agree. Can someone explain why gsub, sub or scan matches with * are
> RM> different than =~ matches with *
>
> RM> puts "hello".gsub(/[aeiou]/, '<\1>') # h<>ll<>
>
> irb(main):024:0> "hello".gsub( /([aeiou])/, "<\\1>" )
>
> Please note the () around the expression.
> After that you can refer with \\1 to the found
> letters.

Why not simply change the 1 to a 0 ?

irb(main):001:0> puts "hello".gsub(/[aeiou]/, '<\0>')
h<e>ll<o>

Stephen Ball

6/21/2007 4:28:00 PM

On 6/21/07, dblack@wobblini.net <dblack@wobblini.net> wrote:
[snip]
> > So: since * matches "zero or more" characters when it starts the
> > search for .* it matches the absence (the 'zero') and then matches the
> > string (the 'or more').
>
> It's the other way around, though; it matches "hello" *first*, and
> then "". So the zero-matching (which I admit I'm among those who find
> unexpected) is happening at the end.
>

Ah, but notice:

"hello".scan(/.*$/)
=> ["hello", ""]

"hello".scan(/^.*/)
=> ["hello"]

Strange indeed, but it seems that's how it's working. Although I
suspect I'm not fully grasping the subtleties introduced by the *
character.

Hmm, the more I think on it I think I have an answer:

The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it's parsed as "zero or more of anything
before the end of the string".

So, if that's correct, you are right that the absence is matched last.
Verified by the fact that the absence follows the string in the
pattern match.

-- Stephen

comp.lang.ruby

Do You Understand Regular Expressions?

growlatoe

Tim Hunter

Axel Etzold

Daniel DeLorme

Ryan Mcdonald

Karl-Heinz Wild

Stephen Ball

Rob Biedenharn

dblack

Brian Adkins

Stephen Ball

x Login to ForumsZone