[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Re: Inverting a regular expression?

Harry Ohlsen

5/2/2005 4:13:00 AM

> From: Eric Mahurin [mailto:eric_mahurin@yahoo.com]
> Sent: Friday, 29 April 2005 22:21
> <snip>
> I don't see any practical application for doing something like
> this. Is this just a academic question?

It was pretty much an academic question, but my colleague actually
wanted to do it, so it was of some real-world interest.

H.


************************************************************************

If you have received this e-mail in error, please delete it and notify the sender as soon as possible. The contents of this e-mail may be confidential and the unauthorized use, copying, or dissemination of it and any attachments to it, is prohibited.

Internet communications are not secure and Hyperion does not, therefore, accept legal responsibility for the contents of this message nor for any damage caused by viruses. The views expressed here do not necessarily represent those of Hyperion.

For more information about Hyperion, please visit our Web site at www.hyperion.com






17 Answers

David Naseby

5/2/2005 9:14:00 PM

0

On 5/2/05, Harry Ohlsen <Harry_Ohlsen@hyperion.com> wrote:
> > From: Eric Mahurin [mailto:eric_mahurin@yahoo.com]
> > Sent: Friday, 29 April 2005 22:21
> > <snip>
> > I don't see any practical application for doing something like
> > this. Is this just a academic question?
>
> It was pretty much an academic question, but my colleague actually
> wanted to do it, so it was of some real-world interest.
>
> H.
>
Erm.. have you thought of the easiest possible inversion
str !~ re

That fits the spec...

--
David Naseby
http://homepages.ihug.com.a...



Nikolai Weibull

5/2/2005 10:11:00 PM

0

David Naseby, May 3:

> Erm.. have you thought of the easiest possible inversion
> str !~ re

That only inverts the sense of the matching, not the actual language
matched itself,
nikolai

--
Nikolai Weibull: now available free of charge at http:/...!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}


Dan Doel

5/3/2005 12:56:00 AM

0

Well, if you want an academic answer... :)

Get a book on automata/computability theory (back when I took such a course,
we used Automata and Computability by Dexter Kozen; it's pretty good).
Therein, you'll find algorithms for converting between DFA (discrete finite
automata) and regular expressions. You can invert a regular expression as
follows:

1) Convert the regex to a DFA
2) Invert the accept states of the DFA, causing it to accept the complement
of the initial language
3) Convert the modified DFA into a regex

Viola.

However, it should be noted that regular expressions aren't really regular
these days. For example, Ruby allows the following regex:

/(a*)b(\1)/

Which accepts the language { (a^n)b(a^n) }, which is context free, not
regular (that is, it cannot be encoded as the language of a finite
automaton---it requires something like a pushdown automaton (FA with a
stack)---or the language of a true regular expression---it requires a
context free grammar). So, if you want to invert an arbitrary
Ruby/Perl/Python "regular expression," you've got a tougher time on your
hands.

-- Dan Doel

Harry Ohlsen wrote:
> It was pretty much an academic question, but my colleague actually
> wanted to do it, so it was of some real-world interest.


Logan Capaldo

5/3/2005 1:43:00 AM

0

On 5/2/05, Dan Doel <dolio@case.edu> wrote:
> Well, if you want an academic answer... :)
>
> Get a book on automata/computability theory (back when I took such a course,
> we used Automata and Computability by Dexter Kozen; it's pretty good).
> Therein, you'll find algorithms for converting between DFA (discrete finite
> automata) and regular expressions. You can invert a regular expression as
> follows:
>
> 1) Convert the regex to a DFA
> 2) Invert the accept states of the DFA, causing it to accept the complement
> of the initial language
> 3) Convert the modified DFA into a regex
>
> Viola.
>
> However, it should be noted that regular expressions aren't really regular
> these days. For example, Ruby allows the following regex:
>
> /(a*)b(\1)/
>
> Which accepts the language { (a^n)b(a^n) }, which is context free, not
> regular (that is, it cannot be encoded as the language of a finite
> automaton---it requires something like a pushdown automaton (FA with a
> stack)---or the language of a true regular expression---it requires a
> context free grammar). So, if you want to invert an arbitrary
> Ruby/Perl/Python "regular expression," you've got a tougher time on your
> hands.
>
> -- Dan Doel
>
> Harry Ohlsen wrote:
> > It was pretty much an academic question, but my colleague actually
> > wanted to do it, so it was of some real-world interest.
>
>

In this case, I would imagine that (Not having taken theory of
computation yet, next semester! woo hoo) Onigurma and the like convert
the regex into a DFA or a PA anyway at some point. Couldn't you go
into the guts and invert the DFA as you describe (not necessarialy
generating the inverted regex) and use the new DFA or PA to match
against the input?



Dan Doel

5/3/2005 3:15:00 AM

0

I've never investigated things like regex engines in real programming
languages. They probably do convert the regex into some 'nicer' form for
actually computing matches.

If it's simply a DFA, then yeah, you can just invert the accept states. That
might even work in the case of a PDA (not too sure; it's been a while)
(also note, it depends on the type of PDA. They can accept based either on
state or empty stack, although you can convert between them. The latter is
probably harder to invert as-is).

I think the situation is even worse, however. Consider the Ruby regex:

/(a*)(b*)(\1)(\2)/

This matches the language { (a^n)(b^m)(a^n)(b^m) }, which I believe isn't
even context-free (someone should check me on this; I've been fumbling
around reading sites on the pumping lemma, but it's been like a year and a
half since I really knew this stuff). So whatever format they convert the
regex into to actually figure out the match, it probably isn't as simple as
even a PDA (you need at least a linear bounded automaton---Turing machine
with a finite tape---to match a context-sensitive language), and it's
probably not as simple as flipping some bits to match the complement.

In fact, now that I think of it, I'm sure people have developed much better
heuristics for matching these kinds of things than building and simulating
a theoretical machine. So even with the guts of the regex, it's probably
not easy to invert.

In addition, although Ruby regexes can match the language above---which I
think is context-sensitive---that doesn't mean it can match all
context-sensitive languages. For instance, I don't think you can match the
classic context-sensitive language { (a^n)(b^n)(c^n) }, because you can't
access the length of previous groups. In fact, I don't think you can even
match { (a^n)(b^n) }, so Ruby regexes can't even match every context-free
language, just a (small?) subset! So, it's entirely possible that the
complement of a Ruby regex isn't expressible as a Ruby regex at all. Then
again, I'm not going to try to prove either way. :)

Anyhow, this is only an issue if you allow references to previously matched
groups. If you eliminate those, Ruby regexes are actually regular
expressions (with some enhancements to make them more concise), and then
you can invert them using the steps I listed earlier.

Apologies for the lengthy post.

-- Dan Doel

Logan Capaldo wrote:
> In this case, I would imagine that (Not having taken theory of
> computation yet, next semester! woo hoo) Onigurma and the like convert
> the regex into a DFA or a PA anyway at some point. Couldn't you go
> into the guts and invert the DFA as you describe (not necessarialy
> generating the inverted regex) and use the new DFA or PA to match
> against the input?

Nikolai Weibull

5/3/2005 7:27:00 AM

0

Dan Doel, May 3:

> Consider the Ruby regex:
>
> /(a*)(b*)(\1)(\2)/
>
> This matches the language { (a^n)(b^m)(a^n)(b^m) }, which I believe
> isn't even context-free â?¦

Backreferencing is an NP-complete problem,
nikolai

--
Nikolai Weibull: now available free of charge at http:/...!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}


Nikolai Weibull

5/3/2005 7:28:00 AM

0

Dan Doel, May 3:

> â?¦ DFA (discrete finite automata) â?¦

Discrete? Don't you mean deterministic?,
nikolai

--
Nikolai Weibull: now available free of charge at http:/...!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}


Nikolai Weibull

5/3/2005 7:30:00 AM

0

Logan Capaldo, May 3:

> In this case, I would imagine that â?¦ Onigurma and the like convert the
> regex into a DFA or a PA anyway at some point.

Oniguruma uses a backtracking NFA simulator,
nikolai

--
Nikolai Weibull: now available free of charge at http:/...!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}


Dan Doel

5/3/2005 8:24:00 AM

0

Nikolai Weibull wrote:
> Discrete? Don't you mean deterministic?,

Oops. Yes, that's what I meant.

Kristof Bastiaensen

5/3/2005 9:31:00 AM

0

On Mon, 02 May 2005 20:55:35 -0400, Dan Doel wrote:
> Well, if you want an academic answer... :)
>
> Get a book on automata/computability theory (back when I took such a course,
> we used Automata and Computability by Dexter Kozen; it's pretty good).
> Therein, you'll find algorithms for converting between DFA (discrete finite
> automata) and regular expressions. You can invert a regular expression as
> follows:
>
> 1) Convert the regex to a DFA
> 2) Invert the accept states of the DFA, causing it to accept the complement
> of the initial language
> 3) Convert the modified DFA into a regex
>
> Viola.
>

I am afraid that wouldn't work. For example /ab/ would become
/[^a][^b]/, which isn't the complement of /ab/. Now the string
"ac" would match neither. (Nor any string with more than two
characters).

KB