Asp Forum - Slow regular expressions :

RHaus

7/27/2006 4:41:00 PM

I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Just try this program:

str1 = "foo " * 70 + "foo ";
str2 = "foo " * 70 + "fo ";

strings = str1, str2;

strings.each { |s|
print "Matching #{s}\n";
if(s =~ /^(\s*foo\s*)*$/)
print "YES!\n";
else
print "NO!\n";
end
}

and compare it with this perl program:
my $str1 = ("foo " x 70) . "foo ";
my $str2 = ("foo " x 70) . "fo ";

my @strings = ( $str1, $str2 );

foreach $s ( @strings ) {
print "Matching $s\n";
if($s =~ /^(\s*foo\s*)*$/) {
print "YES!\n";
} else {
print "NO!\n";
}
}

On my machine, the perl version takes 0.229 seconds. The ruby version
was still running after 519.662 seconds when I killed it.

Is this being taken care of for the next release?

--
Posted via http://www.ruby-....

28 Answers

Caio Chassot

7/27/2006 4:52:00 PM

On 2006-07-27, at 13:41 , Roman Hausner wrote:

> I am disappointed to learn that Ruby obviously implements yet another
> regular expression library that does not avoid very slow matching
> failures.
>
> Is this being taken care of for the next release?

Ruby 1.9 uses a different regex engine, oniguruma. Not sure if it
addresses this particular problem.

Keith Gaughan

7/27/2006 5:36:00 PM

On Fri, Jul 28, 2006 at 01:41:05AM +0900, Roman Hausner wrote:

> I am disappointed to learn that Ruby obviously implements yet another
> regular expression library that does not avoid very slow matching
> failures.
>
> Just try this program:
>
> str1 = "foo " * 70 + "foo ";
> str2 = "foo " * 70 + "fo ";
>
> strings = str1, str2;
>
> strings.each { |s|
> print "Matching #{s}\n";
> if(s =~ /^(\s*foo\s*)*$/)
> print "YES!\n";
> else
> print "NO!\n";
> end
> }

Ah, now while I'm not saying that Ruby's regex engine is slow--it
is--I think it's more likely here that you hit a pathological edge
case in how it works, specifically the '\s*' on each side of 'foo'.

When the code is changed to

strings.each do |s|
print "Matching #{s}\n"
if s =~ /^(foo\s*)*$/
print "Yes!\n"
else
print "No!\n"
end
end

The problem disappears and the Ruby and Perl versions of that code
benchmark similarly.

Patterns in the form (x*y*x*)* have a habit of acting pathologically and
there's almost always a better and clearer way of writing them, mainly
because they have a habit of causing a nasty amount of backtracking.
This applies to a good number of regex engines, and not just Ruby's.

> On my machine, the perl version takes 0.229 seconds. The ruby version
> was still running after 519.662 seconds when I killed it.
>
> Is this being taken care of for the next release?

My suspicion is that Ruby's regex package doesn't account for this
pathological case, where as Perl's--which, to be frank, is pretty much
the best regex engine out there bar none except for its support for
Unicode properties--does.

Not a clue as to whether it'll be taken care of, though. JARH.

K.

--
Keith Gaughan - kmgaughan@eircom.net - http://tal...

Caleb Clausen

7/27/2006 5:41:00 PM

On 7/27/06, Roman Hausner <roman.hausner@gmail.com> wrote:
> I am disappointed to learn that Ruby obviously implements yet another
> regular expression library that does not avoid very slow matching
> failures.
>
> Just try this program:
>
> str1 = "foo " * 70 + "foo ";
> str2 = "foo " * 70 + "fo ";
>
> strings = str1, str2;
>
> strings.each { |s|
> print "Matching #{s}\n";
> if(s =~ /^(\s*foo\s*)*$/)
> print "YES!\n";
> else
> print "NO!\n";
> end
> }

If you take out the first \s* (or move it outside the parens), the
slowness goes away. That first \s* isn't needed. Another trick that
may help here is changing the *'s to +'s.

This is a case of pathological backtracking. Usually, these things are
solved by carefully tuning the Regexp, rather than expecting the
Regexp engine to fix it...

Yes, perl optimizes this case, but the fact is that you're asking for
a lot of silly extra backtracking, and that is what ruby gives you. I
imagine that you didn't know that you'd created a backtracking
monster...

Basically, as long as everything is matching, you're fine and matches
will be nice and zippy. But once that 'fo' causes a mismatch, the
engine has to go back and check every possible way that the spaces in
the string could be matched by either of the 2 whitespace matchers.
Since there are ~70 spaces in your string and each could be matched by
either \s*, there are about 2**70 possibilites to explore, total. You
can see how that might take a little time.

I'm afraid I'm not explaining very well why this Regexp is a
monster... maybe someone else will take a stab.

Berger, Daniel

7/27/2006 6:17:00 PM

Caleb Clausen wrote:
> On 7/27/06, Roman Hausner <roman.hausner@gmail.com> wrote:
>> I am disappointed to learn that Ruby obviously implements yet another
>> regular expression library that does not avoid very slow matching
>> failures.
>>
>> Just try this program:
>>
>> str1 = "foo " * 70 + "foo ";
>> str2 = "foo " * 70 + "fo ";
>>
>> strings = str1, str2;
>>
>> strings.each { |s|
>> print "Matching #{s}\n";
>> if(s =~ /^(\s*foo\s*)*$/)
>> print "YES!\n";
>> else
>> print "NO!\n";
>> end
>> }
>
>
> If you take out the first \s* (or move it outside the parens), the
> slowness goes away. That first \s* isn't needed. Another trick that
> may help here is changing the *'s to +'s.
>
> This is a case of pathological backtracking. Usually, these things are
> solved by carefully tuning the Regexp, rather than expecting the
> Regexp engine to fix it...
>
> Yes, perl optimizes this case, but the fact is that you're asking for
> a lot of silly extra backtracking, and that is what ruby gives you. I
> imagine that you didn't know that you'd created a backtracking
> monster...
>
> Basically, as long as everything is matching, you're fine and matches
> will be nice and zippy. But once that 'fo' causes a mismatch, the
> engine has to go back and check every possible way that the spaces in
> the string could be matched by either of the 2 whitespace matchers.
> Since there are ~70 spaces in your string and each could be matched by
> either \s*, there are about 2**70 possibilites to explore, total. You
> can see how that might take a little time.
>
> I'm afraid I'm not explaining very well why this Regexp is a
> monster... maybe someone else will take a stab.
>

It would take too long and I'm tired. Jeffrey Friedl's "Mastering Regular
Expressions" is the definitive source on the subject. That being said, I think
the OP needs a lesson on greedy vs non-greedy regular expressions.

Also note that Perl must have optimized for this sort of backtracking after
5.005_03, because it hangs as well. I'd be curious as to their reasons for
doing this, but I have my suspicions.

Regards,

Dan

This communication is the property of Qwest and may contain confidential or
privileged information. Unauthorized use of this communication is strictly
prohibited and may be unlawful. If you have received this communication
in error, please immediately notify the sender by reply e-mail and destroy
all copies of the communication and any attachments.

William James

7/27/2006 7:12:00 PM

Roman Hausner wrote:
> I am disappointed to learn that Ruby obviously implements yet another
> regular expression library that does not avoid very slow matching
> failures.
>
> Just try this program:
>
> str1 = "foo " * 70 + "foo ";
> str2 = "foo " * 70 + "fo ";
>
> strings = str1, str2;
>
> strings.each { |s|
> print "Matching #{s}\n";
> if(s =~ /^(\s*foo\s*)*$/)
> print "YES!\n";
> else
> print "NO!\n";
> end
> }
>
> and compare it with this perl program:
> my $str1 = ("foo " x 70) . "foo ";
> my $str2 = ("foo " x 70) . "fo ";
>
> my @strings = ( $str1, $str2 );
>
> foreach $s ( @strings ) {
> print "Matching $s\n";
> if($s =~ /^(\s*foo\s*)*$/) {
> print "YES!\n";
> } else {
> print "NO!\n";
> }
> }
>
> On my machine, the perl version takes 0.229 seconds. The ruby version
> was still running after 519.662 seconds when I killed it.
>
> Is this being taken care of for the next release?
>
> --
> Posted via http://www.ruby-....

Here's the way to do it.

strings = " foo " * 70 + "foo ",
"foo " * 70 + "foo ",
" foo " * 70 + "fo ",
""

strings.each { |s|
puts "Matching #{s}"
if s =~ /^ \s* # May start with whitespace.
# The rest of the string must be 0 or more
# occurrences of "foo" plus whitespace.
(?: foo \s+ ) *
$/x
puts "YES!"
else
puts "NO!"
end
}

Daniel Martin

7/27/2006 8:51:00 PM

Roman Hausner <roman.hausner@gmail.com> writes:

> I am disappointed to learn that Ruby obviously implements yet another
> regular expression library that does not avoid very slow matching
> failures.

Just a note that there are some patterns where I found (in my tests,
on a cygwin environment I can't recreate anymore) that ruby beat perl.

http://snowplow.org/marti...

I should note that the people at perlmonks.org disbelieve my results,
or rather, believe that my results are being skewed by some strange
cygwin issue. I haven't gotten off my butt and redone the same tests
on linux.

RHaus

7/27/2006 10:14:00 PM

Caleb Clausen wrote:
> If you take out the first \s* (or move it outside the parens), the
> slowness goes away. That first \s* isn't needed. Another trick that
> may help here is changing the *'s to +'s.
>
> This is a case of pathological backtracking. Usually, these things are
> solved by carefully tuning the Regexp, rather than expecting the
> Regexp engine to fix it...
>
No. I am not looking for help on regular expressions here and I know
that this particular expression could be optimized. I also know that
some regular expressions with backtracking can get exponentially slow.

However, regular expression can get generated automatically or there
could be other reasons why an optimization a la perl is helpful.

The example simply points out that ruby uses a strategy that does not do
the optimization that is present in perl. My question was, whether Ruby
is going to fix this.

If you can answer that question, I would apreciate it. If not, maybe
somebody
else can.

--
Posted via http://www.ruby-....

William James

7/28/2006 12:13:00 AM

Roman Hausner wrote:
> Caleb Clausen wrote:
> > If you take out the first \s* (or move it outside the parens), the
> > slowness goes away. That first \s* isn't needed. Another trick that
> > may help here is changing the *'s to +'s.
> >
> > This is a case of pathological backtracking. Usually, these things are
> > solved by carefully tuning the Regexp, rather than expecting the
> > Regexp engine to fix it...
> >
> No.

I.e., "No, I won't take out the first \s* (or move it outside the
parens). No, I won't tune the regular expression. No, this isn't
a case of pathological backtracking. No, I won't stop holding
my breath until I turn blue."

> I am not looking for help on regular expressions here and I know
> that this particular expression could be optimized.

No. This regular expression has been pessimized.

> I also know that
> some regular expressions with backtracking can get exponentially slow.
>
> However, regular expression can get generated automatically

Stop generating crappy regular expressions.

> or there
> could be other reasons why an optimization a la perl is helpful.
>
> The example simply points out that ruby uses a strategy that does not do
> the optimization that is present in perl. My question was, whether Ruby
> is going to fix this.

The arrogance. He refuses to listen to any advice about cleaning
up his crap and then demands to know when someone is going
to "fix" Ruby.

I hope that Ruby doesn't start pandering to those who obdurately
refuse to clean up their feculent waste.

Gene Tani

7/28/2006 12:16:00 AM

Daniel Berger wrote:
> Also note that Perl must have optimized for this sort of backtracking after
> 5.005_03, because it hangs as well. I'd be curious as to their reasons for
> doing this, but I have my suspicions.
>

http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-05/msg...

http://perlmonks.org/index.pl?node...

Gene Tani

7/28/2006 12:53:00 AM

Gene Tani wrote:
> Daniel Berger wrote:
> > Also note that Perl must have optimized for this sort of backtracking after
> > 5.005_03, because it hangs as well. I'd be curious as to their reasons for
> > doing this, but I have my suspicions.
> >
>
> http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-05/msg...
>
> http://perlmonks.org/index.pl?node...

(give credt where credit due) this blog pointed to Ilya Z explanation

http://cwilliams.textdriven.com/articles/2005/11/21/ruby-can-learn...

comp.lang.ruby

Slow regular expressions :

RHaus

Caio Chassot

Keith Gaughan

Caleb Clausen

Berger, Daniel

William James

Daniel Martin

RHaus

William James

Gene Tani

Gene Tani

x Login to ForumsZone