[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Slow ruby regexes

Emmanuel

4/10/2007 5:12:00 PM

Hello i've been reading this article, wich has a few benchmarks
(inclding a ruby related one) and many theorical explanations of regex
matching algorithms:

http://swtch.com/~rsc/regexp/re...

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

* Do anyone now if what is said relates to ruby 1.8.6
* If so, is it going to change ?

Thank you!

--
Posted via http://www.ruby-....

47 Answers

MenTaLguY

4/10/2007 5:57:00 PM

0

On Wed, 11 Apr 2007 02:12:09 +0900, Emmanuel <emmanuel@lijasdoblea.com> wrote:
> I became a little worried since i'm making hevy use of regexes in a
> program of mine that shortly i'll have to run on each of thousands of
> text files.

Ruby's regexp implementation is a backtracking matcher similar to Perl's -- while the article's criticism is valid, you should be fine if you write your regular expressions to minimize the amount of backtracking required.

(The author's point is that such care is unnecessary with an NFA-based regexp implementation.)

-mental


SonOfLilit

4/10/2007 5:59:00 PM

0

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

For example, a regex was shown in this list lately that tells you if a
number is prime. That needs backtracking.


Aur

On 4/10/07, Emmanuel <emmanuel@lijasdoblea.com> wrote:
> Hello i've been reading this article, wich has a few benchmarks
> (inclding a ruby related one) and many theorical explanations of regex
> matching algorithms:
>
> http://swtch.com/~rsc/regexp/re...
>
> I became a little worried since i'm making hevy use of regexes in a
> program of mine that shortly i'll have to run on each of thousands of
> text files.
>
> * Do anyone now if what is said relates to ruby 1.8.6
> * If so, is it going to change ?
>
> Thank you!
>
> --
> Posted via http://www.ruby-....
>
>

MenTaLguY

4/10/2007 6:04:00 PM

0

On Wed, 11 Apr 2007 02:59:29 +0900, SonOfLilit <sonoflilit@gmail.com> wrote:
> Read wikipedia on Regex. It explains better than I can why one is used
> over the other (more power).

It's unfortunate, though, that an NFA-based approach isn't selectively used for those expressions which don't use features like backreferences (i.e. the majority of regular expressions people write). The performance is so much better it's not funny.

-mental



Robert Dober

4/10/2007 6:56:00 PM

0

On 4/10/07, MenTaLguY <mental@rydia.net> wrote:
> On Wed, 11 Apr 2007 02:59:29 +0900, SonOfLilit <sonoflilit@gmail.com> wrote:
> > Read wikipedia on Regex. It explains better than I can why one is used
> > over the other (more power).
>
> It's unfortunate, though, that an NFA-based approach isn't selectively used for those expressions which don't use features like backreferences (i.e. the majority of regular expressions people write). The performance is so much better it's not funny.
>

Well the good news is that Ruby 2 Regexs will be much better
performing, I do not know if they will be NFA, though.
The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

But you are right of course that Ruby's current Regex implementation
is not state of the art, useful, very useful nevertheless.

Cheers
Robert
> -mental
>
>
>
>


--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

MenTaLguY

4/10/2007 7:15:00 PM

0

On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:

> The bad news is that the power of regexs will just not let you escape
> with bad design all the time, so knowing a lot about regexes is a
> necessary thing to use them well.

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.

> But you are right of course that Ruby's current Regex implementation
> is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

-mental


Marcin Raczkowski

4/11/2007 7:44:00 AM

0

On Tuesday 10 April 2007 19:14, MenTaLguY wrote:
> On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com>
wrote:
> > The bad news is that the power of regexs will just not let you escape
> > with bad design all the time, so knowing a lot about regexes is a
> > necessary thing to use them well.
>
> No, the bad news is that the bad design of most "regex" implementations
> will not let you escape with using "regexes" as if they were true regular
> expressions; knowing a lot about the implementation of the regex engine is
> a necessary thing to use them well.
>
> > But you are right of course that Ruby's current Regex implementation
> > is not state of the art, useful, very useful nevertheless.
>
> NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in
> other respects we've actually regressed since then.
>
> I'm not suggesting we abandon backreferences and so forth (which are useful
> sometimes), but rather that NFAs should be used for evaluating the majority
> of "regexes" which don't use non-regular features.
>
> -mental

Well topic is rather interesting, i'll try to check it soon, and i'll try to
write ruby interface for c library that author of article reference, i was
looking for good case to test including c into ruby.
I'm not promising anything tho :)

regexps are great tool for text processing, but serious log analyzers (like
one my company use) are written in c and base on lots of low-level string
manipulation, raw binary access to DB et cecera, not to mention that ruby is
much much slower then c anyway.

Greets
Marcin Raczkowski

Robert Dober

4/11/2007 7:50:00 AM

0

On 4/10/07, MenTaLguY <mental@rydia.net> wrote:
> On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
>
> > The bad news is that the power of regexs will just not let you escape
> > with bad design all the time, so knowing a lot about regexes is a
> > necessary thing to use them well.
>
> No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.
>
> > But you are right of course that Ruby's current Regex implementation
> > is not state of the art, useful, very useful nevertheless.
>
> NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.
>
> I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.
>
> -mental
>
>
>


--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

Robert Dober

4/11/2007 7:53:00 AM

0

oops wrong button here :(

On 4/11/07, Robert Dober <robert.dober@gmail.com> wrote:
> On 4/10/07, MenTaLguY <mental@rydia.net> wrote:
> > On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
> >
> > > The bad news is that the power of regexs will just not let you escape
> > > with bad design all the time, so knowing a lot about regexes is a
> > > necessary thing to use them well.
> >
> > No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.
That might be, I am not qualified to discuss this issue.
What I actually meant is that given the need for backtracking
abilities it will just not be possible to "guess" what the user meant
when he/she uses too general an expression and I am talking about NFA
here.
Well just wanted to clear that point up.
> >
> > > But you are right of course that Ruby's current Regex implementation
> > > is not state of the art, useful, very useful nevertheless.
> ><snip>
> >
> >
> >
>
Cheers
--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

MenTaLguY

4/11/2007 3:35:00 PM

0

On Wed, 11 Apr 2007 16:53:26 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
> What I actually meant is that given the need for backtracking
> abilities it will just not be possible to "guess" what the user meant
> when he/she uses too general an expression and I am talking about NFA
> here.

I'm a little confused -- the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn't any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.

For "pure" regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking's worst-case exponential complexity.

-mental



Robert Dober

4/11/2007 6:27:00 PM

0

On 4/11/07, MenTaLguY <mental@rydia.net> wrote:
> On Wed, 11 Apr 2007 16:53:26 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
> > What I actually meant is that given the need for backtracking
> > abilities it will just not be possible to "guess" what the user meant
> > when he/she uses too general an expression and I am talking about NFA
> > here.
>
> I'm a little confused -- the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn't any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.
>
> For "pure" regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking's worst-case exponential complexity.
>
No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match, let me see the NFA should be something like
X start second
a start term
. second second

how could backtracking avoid when no a is found in the <second> state?
The DFA of course could never backtrack but is there a DFA for this regex?

Hope to learn a little bit more about what I seem to have forgotten ;)


Cheers
Robert
%r
> -mental
>
>
>
>


--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw