Asp Forum - [Slightly OT] Case against ".*" regexp

Edgardo Hames

3/16/2007 7:59:00 PM

Some time ago I remember reading a post in this list against using
".*" regexp but I cannot find it -- I guess I don't remember the right
keywords :(
Does anyone remember that?

Thanks a lot,
Ed
--
Encontrá a "Tu psicópata favorito" http://tuxmaniac.bl...

The future is not what it used to be.
-- Paul Valéry

6 Answers

Giles Bowkett

3/16/2007 8:05:00 PM

I didn't see that, but the .* regex can easily misfire. There's a
great book on this called "Mastering Regular Expressions" from
O'Reilly. The author's name I think is Jeff Friedl.

--
Giles Bowkett
http://www.gilesg...
http://gilesbowkett.bl...
http://giles.t...

On 3/16/07, Edgardo Hames <ehames@gmail.com> wrote:
> Some time ago I remember reading a post in this list against using
> ".*" regexp but I cannot find it -- I guess I don't remember the right
> keywords :(
> Does anyone remember that?
>
> Thanks a lot,
> Ed
> --
> Encontrá a "Tu psicópata favorito" http://tuxmaniac.bl...
>
> The future is not what it used to be.
> -- Paul Valéry
>
>

Robert Klemme

3/16/2007 8:07:00 PM

On 16.03.2007 20:58, Edgardo Hames wrote:
> Some time ago I remember reading a post in this list against using
> ".*" regexp but I cannot find it -- I guess I don't remember the right
> keywords :(
> Does anyone remember that?

Not exactly. But you should not use ".*" nested in another starred
expression because that will usually lead to massive backtracking.
Another problem is, that ".*" matches the empty string which is often
not what you want.

Kind regards

robert

Jan Friedrich

3/16/2007 8:27:00 PM

Edgardo Hames schrieb:
> Some time ago I remember reading a post in this list against using
> ".*" regexp but I cannot find it -- I guess I don't remember the right
> keywords :(
Did you probably mean this thread?
http://www.ruby-forum.com/to...

regards
Jan

Brian Candler

3/16/2007 10:13:00 PM

On Sat, Mar 17, 2007 at 05:10:08AM +0900, Robert Klemme wrote:
> On 16.03.2007 20:58, Edgardo Hames wrote:
> >Some time ago I remember reading a post in this list against using
> >".*" regexp but I cannot find it -- I guess I don't remember the right
> >keywords :(
> >Does anyone remember that?
>
> Not exactly. But you should not use ".*" nested in another starred
> expression because that will usually lead to massive backtracking.
> Another problem is, that ".*" matches the empty string which is often
> not what you want.

And also eats as much string as it can without causing the regexp to fail -
*? is the non-greedy form.

Robert Dober

3/16/2007 10:25:00 PM

On 3/16/07, Jan Friedrich <frdrch@gmail.com> wrote:
> Edgardo Hames schrieb:
> > Some time ago I remember reading a post in this list against using
> > ".*" regexp but I cannot find it -- I guess I don't remember the right
> > keywords :(
> Did you probably mean this thread?
> http://www.ruby-forum.com/to...
>
Hmm I am not sure this one was what OP meant, was there not a problem
with the greedy match eating up "<" s in xml/html

if I remember correctly OPITOP (OP in the other post ;) asked why

%r{(<.*>)*} === "<a><b><c>"
Regexp.last_match.captures gave ['<a><b><c>'] instead of ['<a>', '<b>', '<c>']

> regards
> Jan
>
>

HTH
Robert

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

Tim X

3/17/2007 6:27:00 AM

"Edgardo Hames" <ehames@gmail.com> writes:

> Some time ago I remember reading a post in this list against using
> ".*" regexp but I cannot find it -- I guess I don't remember the right
> keywords :(
> Does anyone remember that?
>
> Thanks a lot,
> Ed

I don't remember seeing anything that explicitly states that you should not use
..* - however, many people do use it when they shouldn't. If we assume the
regexp engine is correct (and there have been some posts regarding the
correctness of ruby's regexp in version 1.8), there shouldn't be any issue with
using .*, but there are some points to consider. Many of these relate to the
more general application of regular expressions. Possibly the main issue
relates to how the RE is anchored.

If you just have a RE of .*, well, your pretty much matching against the whole
string and I guess you would say the match is pretty pointless. More often you
will use .* with some other constructs. In this situation, you do need to be
careful to ensure the RE is anchored in some way. If not adequately anchored,
your RE match can involve a lot of backtracking. The .* is greedy and will
attempt to match the biggest string possible, then the next biggest and then
the next to next biggest and so on. If you don't have adequate anchoring, in
the worst case, it will back track to the very first character - for a long
string, this could be very inefficient. In fact, I remember seeing a post from
someone in the perl group years ago who thought they'd found a bug in perl that
caused it to go into an infinite loop. It turns out it wasn't infinite, just a
poorly anchored RE which was taking so long to do all the backtracking that it
gave the appearance of an infinite loop - if theuser had waited long enough,
the program would have terminated eventually.

Often, people use .* rather than spending the time to analyse the real patterns
and strings they are processing. Anchoring to the beginning/end of the string
is often sufficient to drastically improve performance - but using non-greedy
modifiers where appropriate and anchoring to larger distinct patterns in your
string will help. If you know all your strings will have certain
characteristics, like a sequence of 4 digits, then incorporate that information
into your RE - put \d\d\d\d (or whatever) rather than .* to represent that
pattern. Think about ways to give the regexp engine as many clues or as much
information as possible and your unlikely to get poor performance or unexpected
results.

There is a very good book from O'Reilly called "Mastering Regular Expressions"
- can't remember the author's name, but recommend it as a read. Its not a thick
book and has some really interesting background and explination of different
regexp engines/approaches.

Tim

--
tcross (at) rapttech dot com dot au

comp.lang.ruby

[Slightly OT] Case against ".*" regexp