Asp Forum - Regexp equality

Jamis Buck

10/19/2004 6:07:00 PM

Here's an oddity I recently came across in a unit test. Identical
regexps declared with /.../ will be equivilent to each other, but not to
the same regexps created with %r{...}:

$ ruby -ve 'p( /\/blah/ == /\/blah/ )'
ruby 1.8.2 (2004-07-29) [i686-linux]
true

$ ruby -ve 'p( %r{/blah} == %r{/blah} )'
ruby 1.8.2 (2004-07-29) [i686-linux]
true

$ ruby -ve 'p( /\/blah/ == %r{/blah} )'
ruby 1.8.2 (2004-07-29) [i686-linux]
false

This appears to only be the case when there is a forward slash in the
regexp. In other words:

$ ruby -ve 'p( /blah/ == %r{blah} )'
ruby 1.8.2 (2004-07-29) [i686-linux]
true

Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

- Jamis

--
Jamis Buck
jgb3@email.byu.edu
http://www.jamisbuck...

8 Answers

Yukihiro Matsumoto

10/19/2004 6:44:00 PM

Hi,

In message "Re: Regexp equality"
on Wed, 20 Oct 2004 03:07:28 +0900, Jamis Buck <jgb3@email.byu.edu> writes:

|Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

Regexp#== compares literal appearance of regex, e.g.

\/blah vs /blah

I consider this as a feature. I'm open for the discussion (as always).

matz.

Jamis Buck

10/19/2004 6:53:00 PM

Yukihiro Matsumoto wrote:
> Hi,
>
> In message "Re: Regexp equality"
> on Wed, 20 Oct 2004 03:07:28 +0900, Jamis Buck <jgb3@email.byu.edu> writes:
>
> |Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?
>
> Regexp#== compares literal appearance of regex, e.g.
>
> \/blah vs /blah
>
> I consider this as a feature. I'm open for the discussion (as always).

Is there a way to test regexen for equality based on the strings they
would match? Perhaps that's too general, since it depends on the strings
they would be given, but I guess I expected /\/blah/ to be equal to
%r{/blah}.

I suppose /\/blah/.to_s would equal %r{/blah}.to_s? Would that be a
better way to compare regexen?

- Jamis

--
Jamis Buck
jgb3@email.byu.edu
http://www.jamisbuck...

James Gray

10/19/2004 7:18:00 PM

On Oct 19, 2004, at 1:53 PM, Jamis Buck wrote:

> Is there a way to test regexen for equality based on the strings they
> would match?

Wow. I bet that's a pretty tall order. Pulling from the current Ruby
Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that
way. Maybe some internal representation, but I would but super
impressed to see that.

Not at all saying I don't like the idea, just that it's a tall order.

James Edward Gray II

Lloyd Zusman

10/20/2004 12:06:00 AM

James Edward Gray II <james@grayproductions.net> writes:

> On Oct 19, 2004, at 1:53 PM, Jamis Buck wrote:
>
>> Is there a way to test regexen for equality based on the strings they
>> would match?
>
> Wow. I bet that's a pretty tall order. Pulling from the current Ruby
> Quiz, here are several ways to write a regex to match 1..12:
>
> 1|2|3|4|5|6|7|8|9|10|11|12
>
> \d|1[012]
>
> 1?[12]|[03-9]
>
> You're certainly not going to be able to compare like patterns that way.
> Maybe some internal representation, but I would but super impressed to
> see that.
>
> Not at all saying I don't like the idea, just that it's a tall order.

Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

If I'm not mistaken, I believe that this can be shown to be a reduction
of the Halting Problem.

In other words, short of feeding every possible character string to both
regex's and comparing the two sets of matches, I don't think that there
is an algorithmic way to measure equivalence of two regex's which don't
have the same external representation.

--
Lloyd Zusman
ljz@asfast.com
God bless you.

Jamis Buck

10/20/2004 2:58:00 PM

Lloyd Zusman wrote:
> James Edward Gray II <james@grayproductions.net> writes:
>
>
>>On Oct 19, 2004, at 1:53 PM, Jamis Buck wrote:
>>
>>
>>>Is there a way to test regexen for equality based on the strings they
>>>would match?
>>
>>Wow. I bet that's a pretty tall order. Pulling from the current Ruby
>>Quiz, here are several ways to write a regex to match 1..12:
>>
>>1|2|3|4|5|6|7|8|9|10|11|12
>>
>>\d|1[012]
>>
>>1?[12]|[03-9]
>>
>>You're certainly not going to be able to compare like patterns that way.
>>Maybe some internal representation, but I would but super impressed to
>>see that.
>>
>>Not at all saying I don't like the idea, just that it's a tall order.
>
>
> Not only is it a tall order to determine whether any two regex's will
> match exactly the same set of strings, but in the general case, I'm
> pretty sure that this question is undecidable.
>
> If I'm not mistaken, I believe that this can be shown to be a reduction
> of the Halting Problem.
>
> In other words, short of feeding every possible character string to both
> regex's and comparing the two sets of matches, I don't think that there
> is an algorithmic way to measure equivalence of two regex's which don't
> have the same external representation.

I realize that particular question is most likely impossible to answer
for the general case. What I was wanting, though, was to be able to know
whether /\/blah/ is equal to %r{/blah}. Surely that's not too hard?

As I said, though, converting them both to strings gives the same
result, so the strings, at least, are comparable with expected results.

- Jamis

--
Jamis Buck
jgb3@email.byu.edu
http://www.jamisbuck...

Lloyd Zusman

10/20/2004 10:46:00 PM

Jamis Buck <jgb3@email.byu.edu> writes:

> [ ... ]
>
> I realize that particular question is most likely impossible to answer
> for the general case. What I was wanting, though, was to be able to know
> whether /\/blah/ is equal to %r{/blah}. Surely that's not too hard?
>
> As I said, though, converting them both to strings gives the same
> result, so the strings, at least, are comparable with expected results.

OK. Now I better understand your question. I agree with you that
something seems kind of counter-intuitive here:

irb(main):001:0> x = /\/blah/
=> /\/blah/
irb(main):002:0> y = %r{/blah}
=> /\/blah/
irb(main):003:0> z = %r{\/blah}
=> /\/blah/
irb(main):004:0> x == y
=> false
irb(main):005:0> x == z
=> true
irb(main):006:0> x.source
=> "\\/blah"
irb(main):007:0> y.source
=> "/blah"
irb(main):008:0> z.source
=> "\\/blah"

In line 006, why is the source of /\/blah/ returned as "\\/blah"?

It seems to me that it's incorrect to return the leading backslash
doubled. Isn't the backslash in the initial definition of x just an
escape character for enabling the second forward slash to be treated as
part of the regexp? Why is that initial escape character treated as a
bona fide part of the regexp source?

Also, consider this:

irb(main):001:0> x = /\/blah/
=> /\/blah/
irb(main):002:0> y = %r{/blah}
=> /\/blah/
irb(main):003:0> z = %r{\/blah}
=> /\/blah/
irb(main):004:0> x.to_s
=> "(?-mix:\\/blah)"
irb(main):005:0> y.to_s
=> "(?-mix:\\/blah)"
irb(main):006:0> z.to_s
=> "(?-mix:\\/blah)"
irb(main):007:0> x.to_s == y.to_s
=> true
irb(main):008:0> x.to_s == z.to_s
=> true

In other words, the string representation of the 3 regexp's (as opposed
to their _source_ representations) are identical.

So this brings up two questions:

1. Why is the backslash in a statement like this /\/blah/ _not_ just
treated as an escape character when determining the source?

2. Why does the `==' operator compare the source and not the internal
representations? It seems to me that the internal representation is
a much more meaningful piece of information for use in a comparison.

--
Lloyd Zusman
ljz@asfast.com
God bless you.

Eivind Eklund

10/21/2004 3:38:00 PM

On Wed, 20 Oct 2004 09:06:10 +0900, Lloyd Zusman <ljz@asfast.com> wrote:
> Not only is it a tall order to determine whether any two regex's will
> match exactly the same set of strings, but in the general case, I'm
> pretty sure that this question is undecidable.

As far as I can tell: If you run Antimirov's algorithm for NFA
construction, you can compare each extracted partial derivative and
see if it is equal. If all extracted partial derivatives are equal,
your regular expressions are identical. If any differ - they're
different.

Eivind.
--
Hazzle free packages for Ruby?
RPA is available from http://www.rubyar...

Eivind Eklund

10/21/2004 4:00:00 PM

On Thu, 21 Oct 2004 15:37:57 +0000, Eivind Eklund <eeklund@gmail.com> wrote:
> On Wed, 20 Oct 2004 09:06:10 +0900, Lloyd Zusman <ljz@asfast.com> wrote:
> > Not only is it a tall order to determine whether any two regex's will
> > match exactly the same set of strings, but in the general case, I'm
> > pretty sure that this question is undecidable.
>
> As far as I can tell: If you run Antimirov's algorithm for NFA
> construction, you can compare each extracted partial derivative and
> see if it is equal. If all extracted partial derivatives are equal,
> your regular expressions are identical. If any differ - they're
> different.

I should probably include some suitable references so somebody can run
off an implement this:

Subtyping with Antimirov's algorithm:
http://www.inf.uni-konstanz.de/dbis/teaching/current/pathfinder/download/hohenade-ausarb...

Partial derivatives of regular expressions and finite automata
http://citeseer.ist.psu.edu/antimirov95pa...

Eivind.
--
Hazzle free packages for Ruby?
RPA is available from http://www.rubyar...

comp.lang.ruby

Regexp equality

Jamis Buck

Yukihiro Matsumoto

Jamis Buck

James Gray

Lloyd Zusman

Jamis Buck

Lloyd Zusman

Eivind Eklund

Eivind Eklund

x Login to ForumsZone