Tim X
3/17/2007 6:27:00 AM
"Edgardo Hames" <ehames@gmail.com> writes:
> Some time ago I remember reading a post in this list against using
> ".*" regexp but I cannot find it -- I guess I don't remember the right
> keywords :(
> Does anyone remember that?
>
> Thanks a lot,
> Ed
I don't remember seeing anything that explicitly states that you should not use
..* - however, many people do use it when they shouldn't. If we assume the
regexp engine is correct (and there have been some posts regarding the
correctness of ruby's regexp in version 1.8), there shouldn't be any issue with
using .*, but there are some points to consider. Many of these relate to the
more general application of regular expressions. Possibly the main issue
relates to how the RE is anchored.
If you just have a RE of .*, well, your pretty much matching against the whole
string and I guess you would say the match is pretty pointless. More often you
will use .* with some other constructs. In this situation, you do need to be
careful to ensure the RE is anchored in some way. If not adequately anchored,
your RE match can involve a lot of backtracking. The .* is greedy and will
attempt to match the biggest string possible, then the next biggest and then
the next to next biggest and so on. If you don't have adequate anchoring, in
the worst case, it will back track to the very first character - for a long
string, this could be very inefficient. In fact, I remember seeing a post from
someone in the perl group years ago who thought they'd found a bug in perl that
caused it to go into an infinite loop. It turns out it wasn't infinite, just a
poorly anchored RE which was taking so long to do all the backtracking that it
gave the appearance of an infinite loop - if theuser had waited long enough,
the program would have terminated eventually.
Often, people use .* rather than spending the time to analyse the real patterns
and strings they are processing. Anchoring to the beginning/end of the string
is often sufficient to drastically improve performance - but using non-greedy
modifiers where appropriate and anchoring to larger distinct patterns in your
string will help. If you know all your strings will have certain
characteristics, like a sequence of 4 digits, then incorporate that information
into your RE - put \d\d\d\d (or whatever) rather than .* to represent that
pattern. Think about ways to give the regexp engine as many clues or as much
information as possible and your unlikely to get poor performance or unexpected
results.
There is a very good book from O'Reilly called "Mastering Regular Expressions"
- can't remember the author's name, but recommend it as a read. Its not a thick
book and has some really interesting background and explination of different
regexp engines/approaches.
Tim
--
tcross (at) rapttech dot com dot au