Phrogz
11/22/2007 4:02:00 AM
On Nov 21, 7:32 pm, RichardOnRails
<RichardDummyMailbox58...@uscomputergurus.com> wrote:
> 1. "(.*?)", or specifically, the "?" in that expression.
The ? in this case makes the match non-greedy. For example:
irb(main):007:0> s = "aaaaaaae"
=> "aaaaaaae"
irb(main):008:0> s[ /a+[aeiou]/ ]
=> "aaaaaaae"
irb(main):009:0> s[ /a+?[aeiou]/ ]
=> "aa"
By default, the ?, *, +, and {n,m} modifiers are all greedy,
attempting to match the longest substring possible while still
allowing the regular expression to succeed. As seen above, /a+/ keeps
finding a's until it cannot find any more, and then goes on to try and
match the rest of the pattern.
Adding a ? after one of those quantifiers makes it non-greedy. For
example:
a?? - match zero or one 'a' characters (prefer to match zero)
a*? - match zero or more 'a' characters (prefer as few as possible)
a+? - match one or more 'a' characters (prefer as few as possible)
a{3,} - match at least 3 'a' characters (prefer as few as possible)
As seen in the irb example above, /a+?/ matched a single 'a', and then
checked to see if it could find a vowel afterwards.
You'll often see this non-greedy matching used in simple non-nested
pairing, like with HTML tags.
%r{<p>(.*?)</p>}
will match "<p>", followed by the fewest number of characters until it
sees "</p>".
Without the non-greedy quantifier, the .+ could skim right over other
closing "</p>" characters as long as at it could find one at the end.
> 2. "(?:\d+ [.]?)", or the two question marks in this case.
The first one is part of the (?:...) construct. While the parenthesis
in /(xxx)/ will save the match group for later matching or
substitution, putting a ?: pair at the front tells the regexp to not
bother saving the contents as a numbered group. For example:
/(?:foo|fu)?bar/
will match "foobar", "fubar", or "bar", without saving "foo", "fu", or
"" as a group.
The second question mark follows a character set [...], which itself
matches a single character from the options inside the set. The
question mark in this case (and in my "fubar" example above) means
"match zero or one of the preceding characters/group expressions".
Since the character set has a single period inside it, this:
[.]?
means "And there may or may not be a period here."
This is identical to the regexp:
\.?
where the backslash escapes the traditional meaning of a period (match
any character [except possibly a newline]), and instead causes it to
mean a literal period.