Asp Forum - recursive brace matching with Ruby regexp

Jason Sweat

11/5/2004 12:05:00 AM

I wanted to learn Ruby, so I picked a small task of trying to write a
command line script to parse PHP classes and shell out some unit test
cases. I have it working for the most part, but I ran across a
problem trying to use Ruby regexp to find a set of matching curly
braces.

Please forgive the intrusion of this PHP code onto the list, but I
wanted to give you the flavor of what I am attempting to do, that can
be easily done with recursive regular expression available in the Perl
compatiable regexp engine.

<php>
$test = <<<EOS
/* some stuff */
class foo {
public \$var;
public function __construct() {}
public function bar() {
if (false) {
}
}

}
// some other stuff
EOS;

$re = <<<EOS
~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
EOS;
preg_match($re, $test, $match);
echo "your class matched:\n", $match[1];

</php>

Now it appears the regexp engine in Ruby does not support recursion
(at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
working on, and with what I know how to test), thus the only
workaround I found was very ugly, model the nesting of braces to a
fixed depth, i.e.

open = '\{'
close = '\}'
other = '[^\{\}]*'
l1 = other+open+other+close+other
l2 = other+open+'('+l1 +')+'+other+close+other
l3 = other+open+'('+l2 +')+'+other+close+other
l4 = other+open+'('+l3 +')+'+other+close+other
l5 = other+open+'('+l4 +')+'+other+close+other
re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+')|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:'+other+
'))+'+close, 'ixm')

This code did work, but sometimes timed out on valid real classes.

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?

Thanks,

Regards,
Jason
--
http://blog.casey...

18 Answers

James Gray

11/5/2004 12:50:00 AM

On Nov 4, 2004, at 6:04 PM, Jason Sweat wrote:

> I expect I am probably missing some facet of Ruby that eaily allows me
> to next regexp inside of the Ruby code in some fasion to achieve the
> result I am looking for, but how to do so eludes me. Can anyone
> provide some insight for me on this situation?

Ruby's regex engine doesn't yet support this, but will in future
versions.

Hope that helps.

James Edward Gray II

Bill Kelly

11/5/2004 1:39:00 AM

From: "James Edward Gray II" <james@grayproductions.net>
> On Nov 4, 2004, at 6:04 PM, Jason Sweat wrote:
>
> > I expect I am probably missing some facet of Ruby that eaily allows me
> > to next regexp inside of the Ruby code in some fasion to achieve the
> > result I am looking for, but how to do so eludes me. Can anyone
> > provide some insight for me on this situation?
>
> Ruby's regex engine doesn't yet support this, but will in future
> versions.

Just for fun,

dat = <<-"END_TEXT"
/* some stuff */
class foo {
public \$var;
public function __construct() {}
public function bar() {
if (false) {
}
}

}
// some other stuff
END_TEXT

repl = []
true while dat.gsub!(/$[^(){}]*$|\{[^(){}]*\}/) do |str|
repl.push(str)
"\0#{repl.length - 1}\0"
end
dat.scan(/class\s+\w+\s+\0\d+\0/) do
match = $&
true while match.gsub!(/\0(\d+)\0/) { repl[$1.to_i] }
puts "Your class matched:", match
end

The above matches arbitrarily deeply nested ( ) and { }
blocks... but all the "classes" it matches have to be
at toplevel, can't be inside a { } or ( ) themselves...

I don't think it would be hard to modify to handle nested
classes, but I haven't thought it through...

In case it's not obvious how it works... It replaces,
from innermost to outermost, () and {} spans with a
token, and saves the original span in an array.

When it finds a class match followed by one of these
tokens, it expands it back out to its original
representation.

Regards,

Bill

Mark Hubbart

11/5/2004 2:04:00 AM

Hi,

On Fri, 5 Nov 2004 09:04:30 +0900, Jason Sweat <jason.sweat@gmail.com> wrote:
> I wanted to learn Ruby, so I picked a small task of trying to write a
> command line script to parse PHP classes and shell out some unit test
> cases. I have it working for the most part, but I ran across a
> problem trying to use Ruby regexp to find a set of matching curly
> braces.
>
> Please forgive the intrusion of this PHP code onto the list, but I
> wanted to give you the flavor of what I am attempting to do, that can
> be easily done with recursive regular expression available in the Perl
> compatiable regexp engine.
>
> <php>
> $test = <<<EOS
> /* some stuff */
> class foo {
> public \$var;
> public function __construct() {}
> public function bar() {
> if (false) {
> }
> }
>
> }
> // some other stuff
> EOS;
>
> $re = <<<EOS
> ~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
> EOS;
> preg_match($re, $test, $match);
> echo "your class matched:\n", $match[1];
>
> </php>
>
> Now it appears the regexp engine in Ruby does not support recursion
> (at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
> working on, and with what I know how to test), thus the only
> workaround I found was very ugly, model the nesting of braces to a
> fixed depth, i.e.
>
> open = '\{'
> close = '\}'
> other = '[^\{\}]*'
> l1 = other+open+other+close+other
> l2 = other+open+'('+l1 +')+'+other+close+other
> l3 = other+open+'('+l2 +')+'+other+close+other
> l4 = other+open+'('+l3 +')+'+other+close+other
> l5 = other+open+'('+l4 +')+'+other+close+other
> re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+')|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:'+other+
> '))+'+close, 'ixm')
>
> This code did work, but sometimes timed out on valid real classes.
>
> I expect I am probably missing some facet of Ruby that eaily allows me
> to next regexp inside of the Ruby code in some fasion to achieve the
> result I am looking for, but how to do so eludes me. Can anyone
> provide some insight for me on this situation?

Regular expressions, by all standard definitions, aren't recursive.
Perl's regexen have been extended to allow it, but it really isn't
considered a standard regex feature. You might try using a simple
tokenizer... here's a quick attempt:

def parse(code)
chunks = []
loop do
chunks << text.slice!(/\A.*?(?=[{}])/m) # match start of string to
before next bracket
bracket = text.slice! 0
chunks << parse(text) if bracket == ?{
return chunks if bracket == ?}
return chunks if text.size == 0
end
end

This returns a recursive array that holds all the text chunks around
the brackets. here's some sample code (can't remember exactly what php
looks like ATM) and sample output:

PHP code:

class Foo {
def bar(){
if true{
do_stuff()
}else{
do_nothing()
}
clean_up()
}
}

Then, after recursively stripping whitespace from strings, the
pretty-printed array:

["class Foo",
["def bar()",
["if true", ["do_stuff()"], "else", ["do_nothing()"], "clean_up()"],
""],
nil]

uh, yeah, I know it drops off the last piece of text (reads it as nil)
but I don't want to figure that out just yet. Dinner calls :)

HTH,
Mark

>
> Thanks,
>
> Regards,
> Jason
> --
> http://blog.casey...
>
>

Jason Sweat

11/5/2004 2:41:00 AM

On Fri, 5 Nov 2004 10:39:11 +0900, Bill Kelly <billk@cts.com> wrote:
> repl = []
> true while dat.gsub!(/$[^(){}]*$|\{[^(){}]*\}/) do |str|
> repl.push(str)
> "\0#{repl.length - 1}\0"
> end
> dat.scan(/class\s+\w+\s+\0\d+\0/) do
> match = $&
> true while match.gsub!(/\0(\d+)\0/) { repl[$1.to_i] }
> puts "Your class matched:", match
> end
>
> The above matches arbitrarily deeply nested ( ) and { }
> blocks... but all the "classes" it matches have to be
> at toplevel, can't be inside a { } or ( ) themselves...
>
> In case it's not obvious how it works... It replaces,
> from innermost to outermost, () and {} spans with a
> token, and saves the original span in an array.

Thanks Bill,

I think I see how that works. I will play with it and see if it
solves my problem. I knew there had to be a much cleaner way than
what I was doing :)

Regards,
Jason
--
http://blog.casey...

James Gray

11/5/2004 3:17:00 AM

On Nov 4, 2004, at 8:04 PM, Mark Hubbart wrote:

> Regular expressions, by all standard definitions, aren't recursive.
> Perl's regexen have been extended to allow it, but it really isn't
> considered a standard regex feature.

Of course, you are correct. However, recursive "Regular Expressions"
are becoming fairly common place now. Ruby's next regex engine will
support them as well.

Regular Expression has evolved considerably from the original
definition, with recursive capabilities being just another change in a
long line of added usefulness. Does that really means they cease to be
Regular Expressions? Longhorn will still be Windows, right?

I think of it more in terms of supersets or, for a programming slant,
subclassing. The spirit of Regular Expressions, patterns to
locate/breakdown text, is intact, I believe. They're just even more
handy now.

Just my two cents.

James Edward Gray II

P.S. Proving whether or not Perl 6 changes will still be "Regular
Expression" is left as an exercise for the reader. ;)

Mark Hubbart

11/5/2004 3:59:00 AM

Hi,

On Fri, 5 Nov 2004 12:17:01 +0900, James Edward Gray II
<james@grayproductions.net> wrote:
> On Nov 4, 2004, at 8:04 PM, Mark Hubbart wrote:
>
> > Regular expressions, by all standard definitions, aren't recursive.
> > Perl's regexen have been extended to allow it, but it really isn't
> > considered a standard regex feature.
>
> Of course, you are correct. However, recursive "Regular Expressions"
> are becoming fairly common place now. Ruby's next regex engine will
> support them as well.

Are you saying Oniguruma *will* support it, or *does* support it? It
would be nice if it already does :D

> Regular Expression has evolved considerably from the original
> definition, with recursive capabilities being just another change in a
> long line of added usefulness. Does that really means they cease to be
> Regular Expressions? Longhorn will still be Windows, right?
>
> I think of it more in terms of supersets or, for a programming slant,
> subclassing. The spirit of Regular Expressions, patterns to
> locate/breakdown text, is intact, I believe. They're just even more
> handy now.

My statement wasn't supposed to insinuate that adding recursion is
bad, or turns regular expressions into something else; I was just
pointing out that (last time I checked) you can't *expect* to be able
to use recursion in regexen in a particular language. Perl's regexen
are *more* than regexen; you can even embed perl code in them. Also,
I'm pretty sure the recursive matching wasn't in Perl when I last used
it, a couple years back.

> Just my two cents.
>
> James Edward Gray II
>
> P.S. Proving whether or not Perl 6 changes will still be "Regular
> Expression" is left as an exercise for the reader. ;)

:)

cheers,
Mark

James Gray

11/5/2004 4:07:00 AM

On Nov 4, 2004, at 9:59 PM, Mark Hubbart wrote:

> Are you saying Oniguruma *will* support it, or *does* support it? It
> would be nice if it already does :D

Sure does.

James Edward Gray II

Mark Hubbart

11/5/2004 5:05:00 AM

On Fri, 5 Nov 2004 13:06:40 +0900, James Edward Gray II
<james@grayproductions.net> wrote:
> On Nov 4, 2004, at 9:59 PM, Mark Hubbart wrote:
>
> > Are you saying Oniguruma *will* support it, or *does* support it? It
> > would be nice if it already does :D
>
> Sure does.

Indeed it does! I just tried it out. And named subexpressions work too
:) many new features since I last checked it out:

http://www.geocities.jp/kosako1/oniguruma/...

Thanks for pointing it out.

cheers,
Mark

Jos Backus

11/5/2004 7:21:00 AM

On Fri, Nov 05, 2004 at 02:05:00PM +0900, Mark Hubbart wrote:
> http://www.geocities.jp/kosako1/oniguruma/...

A-1. Syntax depend options

+ ONIG_SYNTAX_RUBY
(?m): dot(.) match newline

+ ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
(?s): dot(.) match newline
(?m): ^ match after newline, $ match before newline

Any idea why Ruby was chosen to behave differently from Perl here?

--
Jos Backus _/ _/_/_/ Sunnyvale, CA
_/ _/ _/
_/ _/_/_/
_/ _/ _/ _/
jos at catnook.com _/_/ _/_/_/ require 'std/disclaimer'

Eivind Eklund

11/5/2004 8:19:00 AM

On Fri, 5 Nov 2004 12:17:01 +0900, James Edward Gray II
<james@grayproductions.net> wrote:
> On Nov 4, 2004, at 8:04 PM, Mark Hubbart wrote:
>
> > Regular expressions, by all standard definitions, aren't recursive.
> > Perl's regexen have been extended to allow it, but it really isn't
> > considered a standard regex feature.
>
> Of course, you are correct. However, recursive "Regular Expressions"
> are becoming fairly common place now. Ruby's next regex engine will
> support them as well.
>
> Regular Expression has evolved considerably from the original
> definition, with recursive capabilities being just another change in a
> long line of added usefulness. Does that really means they cease to be
> Regular Expressions?

Yes.

It very specifically mean that they stop being regular expressions,
because the "regular" actually has a specific meaning (coming from
regular sets/context free language theory), and has the nice property
of compiling to a Deterministic Finite Automation with only linear
increase in size of the DFA compared to the regular expression.

The true regular expressions consists of ^, $, characters, *, and (|)
(alternation). Most of the "original" extensions compile cleanly to
this, and can still be considered regular. When you start using
backrefs in matching or recursion, the expression is no longer regular
(which, among other things, will almost certainly force your poor
regexp engine into Non-deterministic Finite Automation mode - unless
you've got exponential amounts of RAM, of course ;-)

http://en.wikipedia.org/wiki/Regular_... gives a bit more background.

Eivind.
--
Hazzle free packages for Ruby?
RPA is available from http://www.rubyar...

comp.lang.ruby

recursive brace matching with Ruby regexp

Jason Sweat

James Gray

Bill Kelly

Mark Hubbart

Jason Sweat

James Gray

Mark Hubbart

James Gray

Mark Hubbart

Jos Backus

Eivind Eklund

x Login to ForumsZone