Asp Forum - Alternate Regular Expressions?

Ari Brown

8/7/2007 1:12:00 AM

Just randomly curious -

Is there an alternate RegExp "language" to the current one in Ruby
and Perl?

Don't get me wrong, I love the current RegExp in Ruby, but I'm
allowed to be curious...

Also, is Ruby going to jump on the PERL 6 RegExp ship?

^^^^^^^ That's a big one to some people I know.

Thanks,
~ Ari
English is like a pseudo-random number generator - there are a
bajillion rules to it, but nobody cares.

20 Answers

Phlip

8/7/2007 1:41:00 AM

Ari Brown wrote:

> Just randomly curious -
>
> Is there an alternate RegExp "language" to the current one in Ruby and
> Perl?

I don't know. So here's a dissertation on where to start.

The good news is a RegExp is only two things at heart...

- a Domain-Specific Language to program
- a state machine.

The bad news is, back in the day, people used to invent DSL as long strings
of easily parsed characters. For example, a language called LSYSTEM might
describe turtle graphics like this:

s=[::cc!!!!&&[FFcccZ]^^^^FFcccZ] # upper spikes

The really bad news is RegExp is one of these string-oriented DSLs that
stuck. It will always be useful, so programmers forget how much room it has
for improvement.

The good news is Ruby excels at generating light DSLs. The equivalent
expression for a modern implementation of LSYSTEM might look like this:

upper_spikes = push.twist(2).thinner(2).increase_angle(4)....

etc. Because Ruby gives your programming interfaces extreme notational
flexibility, you can declare the interfaces most convenient for your domain.

So start writing! and research other DSLs as you go. For example, here's a
DSL written with C++ metaprogramming:

http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/...

Whenever you like, that language slips back to raw RegExp. Your effort
should have a similar shunt.

> English is like a pseudo-random number generator - there are a bajillion
> rules to it, but nobody cares.

Of all the world's languages, English is both the ugliest and the
beautifulest.

--
Phlip
http://www.oreilly.com/catalog/9780...
"Test Driven Ajax (on Rails)"
assert_xpath, assert_javascript, & assert_ajax

Ari Brown

8/7/2007 1:58:00 AM

On Aug 6, 2007, at 9:40 PM, Phlip wrote:

>
> So start writing! and research other DSLs as you go.

Ugh. If I must (which I must). What would you suggest as syntax?

Also, should I completely try to reinvent the wheel, or create a
wrapper for current RegExp?

Man. I need a mentor on this :-|

aRi
--------------------------------------------|
IMO, Arabic has THE most beautiful script.
Poetically, English is extremely beautiful. It's like a language of
RegExp - except there are no rules!
Spoken, the most beautiful language is either French (sorry) or
Esperanto.

Tim Hunter

8/7/2007 2:09:00 AM

Ari Brown wrote:
>
> On Aug 6, 2007, at 9:40 PM, Phlip wrote:
>
>>
>> So start writing! and research other DSLs as you go.
>
> Ugh. If I must (which I must). What would you suggest as syntax?
>
> Also, should I completely try to reinvent the wheel, or create a
> wrapper for current RegExp?
>
> Man. I need a mentor on this :-|
>
This might give you a place to start:
http://en.wikipedia.org/wiki/Parsing_expressi...

--
RMagick OS X Installer [http://rubyforge.org/project...]
RMagick Hints & Tips [http://rubyforge.org/forum/forum.php?for...]
RMagick Installation FAQ [http://rmagick.rubyforge.org/instal...]

Phlip

8/7/2007 2:52:00 AM

Ari Brown wrote:

> Ugh. If I must (which I must).

You missed where I said I didn't know the actual answer.

> What would you suggest as syntax?

Ruby itself, as a DSL; that was the point.

rx = match('foo') or match('bar') # like /(foo|bar)/
assert_equal [['foo', 'bar']], rx('a foo b bar')

Make match() return an object that overloads the or operator, and away you
go!

--
Phlip
http://www.oreilly.com/catalog/9780...
"Test Driven Ajax (on Rails)"
assert_xpath, assert_javascript, & assert_ajax

Kenneth McDonald

8/7/2007 2:52:00 AM

Ari,

How serious are you about this? Several years ago I wrote a Python
library that treats Python regular
expressions as semantic, not syntactic, objects, and that has been
incredibly useful to me. I've started
to port it to Ruby, but simply don't have the time. If you do (you're
probably looking at a couple of
weeks of full-time-equivalent hours to do a good job, including decent
documentation), I'm happy to pass
on the Python code, the Ruby code, and give advice and so on.

To help you evaluate this, and also as a potential source of ideas in
case you do something else, I've
appended my (probably out of date) intro text to the library at the
bottom of this reply.

Cheers,
Ken

Ari Brown wrote:
>
> On Aug 6, 2007, at 9:40 PM, Phlip wrote:
>
>>
>> So start writing! and research other DSLs as you go.
>
> Ugh. If I must (which I must). What would you suggest as syntax?
>
> Also, should I completely try to reinvent the wheel, or create a
> wrapper for current RegExp?
>
> Man. I need a mentor on this :-|
>
> aRi
> --------------------------------------------|
> IMO, Arabic has THE most beautiful script.
> Poetically, English is extremely beautiful. It's like a language of
> RegExp - except there are no rules!
> Spoken, the most beautiful language is either French (sorry) or
> Esperanto.
>
>
Text from the _Python_ library (In retrospect, I would do quite a bit
different):

Overview
========

'rex' provides regular expression and parsing facilities. It uses
(and is intended to functionally
replace) the Python 're' module.

Regular expression functionality is provided through the '_Rexp' and
'MatchResult' classes,
and the CHAR, REP0, REP1, OPT, PAT, and ALT constructs.
These constructs can be used as or provide functions to create
rexps, and also define
attributes for commonly used rexps. (For example, PAT.float provides
a rexp
which matches basic floating-point (no exponent) numbers.)

Pattern-Matching Example
----------------------

If you are familiar with regular expressions, the following will
probably make at
least some sense. If you are not, skip this example for now. In
either case, come
back to it once you have have read the formal definitions of
functions and
constructs provided by rex.

COMPLEX= PAT.float['re'] + REP0.whitespace + ALT("+", "-")['op'] + REP0.whitespace + PAT.float['im'] + 'i'

The above example defines a pattern which will match complex
numbers, of the form "-2.718 + 3.14i", for example. It uses the
predefined
match expressions PAT.float and REP0.whitespace to
ease the definition. Applied to the example complex number string,
the result will contain three
named substrings: 're' will map to "-2.718", "op" will map to "+",
and "im" will map to "3.14".

SEQ is an alternative form of joining rexps; the above is equivalent to:

COMPLEX= SEQ(
PAT.float['re'],
REP0.whitespace,
ALT("+", "-")['op'],
REP0.whitespace,
PAT.float['im'],
'i'
)

Regular Expressions
---------------

This is an introduction to using the pattern-matching
(regular-expression-related)
part of rex. See documentation associated
with a specific method/function/name for details on that entity.

In the following, we use the abbreviation RE to refer to standard
regular
expressions defined as strings, and the word 'rexp' to refer to rex
objects
which denote regular expressions.

The starting point for building a rexp is either rex.PAT,
which we'll just call PAT, or rex.CHAR, which we'll just call CHAR,
or rex.LIT.
CHAR provides rexps defining a set of characters, and which
will match a single character string if that character is in the given
set. In addition to providing attributes which provide prebuilt
character
sets, the CHAR function may be used to define your own character
sets.

LIT builds rexps which match strings of varying lengths.

REP0 and REP1 are zero or more and 1 or ore

Also

- PAT._someattribute_ returns (for defined attributes) a
corresponding rexp.
For example, PAT.stringstart returns a rexp matching at the
start of a string.

- CHAR(a1, a2, . . .) returns a rexp matching a single character
from a set
of characters defined by its arguments. For example,
CHAR("-", ["0","9"], ".")
iter the characters necessary to build basic floating point
numbers.
See CHAR docs for details.

- CHAR._someattribute_ returns (for defined attributes) a
corresponding rexp
defining a set of characters.
For example, CHAR.digit returns a rexp matching a single digit.

Now assume that A, B, C,... are rexps. The following Python expressions
(_not_ strings) may be used to build more complex rexps:

- X | Y | Z . . . : returns a rexp which iter a string if any of
the operands
match that string. Similar to "X|Y|Z" in normal REs, except
of course you can't
use Python code to define a normal RE.

- X + Y + Z ...: returns a rexp which iter a string if all of X,
Y, Z match consecutive
substrings of the string in succession. Like "XYZ" in normal
REs.

- X*n : returns a rexp which iter a number of times as defined by n.
This replaces '?', '+', and '*' as used in normal REs. See
docs for details.
'rex' defines constants which allow you to say X*REP0,
X*REP1, or X*MAYBE,
indicating (0 or more iter), (1 or more iter), or (0 or 1 iter),
respectively.

- X**n : Like X*n, but does nongreedy matching.

- +X : positive lookahead assertion: iter if X iter, but doesn't
consume any of the input.

- ~+X : negative lookahead assertion: iter if X _doesn't_ match,
but doesn't consume any of the input.

- -X, ~-X : positive and negative lookback assertions. Lke
lookahead assertions,
but in the other direction.

- X[name] : name must be a string: any matched by X can be referred
to by the given name in the match result object. (This is
the equivalent
of named groups in the re module).

- X.group() : X will be in an unnamed group, referable by number.

In addition, a few other operations may be performed:

- Some of the attributes defined in PAT have "natural inverses";
for such
attributes, the inverse may be taken. For example, ~PAT.digit is
a pattern matching any character except a digit.

- Character classes may be inverted: ~CHAR("aeiouAEIOU") returns
a pattern
matching any except a vowel.

- 'ALT' gives a different way to denote alternation: ALT(X, Y,
Z,...) does
the same thing as X | Y | Z | . . ., except that none of the
arguments
to ALT need be rexps; any which are normal strings will be
converted
to a rexp using PAT.

- 'SEQ' can take multiple arguments: PAT(X, Y, Z,...), which
gives the same
result as PAT(X) + PAT(Y) + PAT(Z) + . . . .

Finally, a very convenient shortcut is that only the first object in
a sequence of
operator/method calls needs to be a rexp; all others will be
automatically
converted as if LIT(...) had been called on them. For example, the
sequence X | "hello" is the same as X | LIT("hello")

Ari Brown

8/7/2007 3:10:00 AM

I'm moderately serious. This is going to be one of those projects
that won't see the light of day for maybe 6 months to a year.
This looks largely what I was hoping to make, although in Ruby I had
invisioned this:

matching email addresses (sample):
a = LeetExp.new(:letters => [[a-z], :insensitive],
:string => "@",
:letters => [[a-z], :insensitive],
:string => ".",
:string => ["com", "net", "org", "edu"]
)

case line
when a
# ...

My idea is to make it logical and human readable. Ruby is a language
for humans and UberBeings, and I think this should reflect Ruby's ideas.

Also, was you library a wrapper for underlying PERL RegExp? or was it
the whole RegExp engine?

Thanks,
Ari

On Aug 6, 2007, at 10:51 PM, Kenneth McDonald wrote:

> Ari,
>
> How serious are you about this? Several years ago I wrote a Python
> library that treats Python regular
> expressions as semantic, not syntactic, objects, and that has been
> incredibly useful to me. I've started
> to port it to Ruby, but simply don't have the time. If you do
> (you're probably looking at a couple of
> weeks of full-time-equivalent hours to do a good job, including
> decent documentation), I'm happy to pass
> on the Python code, the Ruby code, and give advice and so on.
>
> To help you evaluate this, and also as a potential source of ideas
> in case you do something else, I've
> appended my (probably out of date) intro text to the library at the
> bottom of this reply.
>
> Cheers,
> Ken
>

--------------------------------------------|
If you're not living on the edge,
then you're just wasting space.

Kenneth McDonald

8/7/2007 3:30:00 AM

Ari Brown wrote:
> I'm moderately serious. This is going to be one of those projects that
> won't see the light of day for maybe 6 months to a year.
> This looks largely what I was hoping to make, although in Ruby I had
> invisioned this:
>
> matching email addresses (sample):
> a = LeetExp.new(:letters => [[a-z], :insensitive],
> :string => "@",
> :letters => [[a-z], :insensitive],
> :string => ".",
> :string => ["com", "net", "org", "edu"]
> )
>
> case line
> when a
> # ...
>
> My idea is to make it logical and human readable. Ruby is a language
> for humans and UberBeings, and I think this should reflect Ruby's ideas.
Reflecting on my own experience, I'd suggest a less verbose notation,
and one that uses Ruby idioms more. For example:

letters = CharClass.new('a'..'z').case_insensitive
a = letters + "@" + letters + "." + (Literal.new("com") | "net" | "org"
| "edu")

It's not at all difficult to do this with Ruby. Strings can be used for
literals and character classes, and
ranges are perfect for use as char ranges in character classes.

Also, the ability to safely combine regular expressions (as shown above,
where "letters" is used in "a")
is _paramount_ in making this sort of wrapper really useful.
>
> Also, was you library a wrapper for underlying PERL RegExp? or was it
> the whole RegExp engine?
>
It was in Python; instances of my 'rex' class simply construct and use
Python patterns, and their associated
functions, internally and invisibly to the user.

Ken
> Thanks,
> Ari
>

Robert Klemme

8/7/2007 5:31:00 AM

On 07.08.2007 05:10, Ari Brown wrote:
> I'm moderately serious. This is going to be one of those projects that
> won't see the light of day for maybe 6 months to a year.
> This looks largely what I was hoping to make, although in Ruby I had
> invisioned this:
>
> matching email addresses (sample):
> a = LeetExp.new(:letters => [[a-z], :insensitive],
> :string => "@",
> :letters => [[a-z], :insensitive],
> :string => ".",
> :string => ["com", "net", "org", "edu"]
> )

You cannot do this because Hashes are unordered so you loose the
original order. Also [a-z] is only valid if you define local variables
a and z.

Personally I find regular expressions pretty readable - at least if they
are crafted properly. :-) See also below.

> case line
> when a
> # ...
>
> My idea is to make it logical and human readable. Ruby is a language for
> humans and UberBeings, and I think this should reflect Ruby's ideas.

Do you know the /x modifier? Than can go a long way to make a regular
expression readable. For example:

input = <<TEXT
adjasdkajda dadkajd foo@bar.com adklskkdaldjskj
postmaster@root.edu adkjasdjk
blah@org akjsd askdl asd noname@foo.net hello
asdj
TEXT

input.scan %r{
\b # word boundary
(?i:[a-z]+) # user name
@ # the famous "at" sign
(?i:[a-z]+) # host name
\. # a literal dot
(?:com|net|org|edu) # only some of the TLDs
\b # word boundary
}x do |match|
puts "Found email address #{match}"
end

Kind regards

robert

David A. Black

8/7/2007 8:53:00 AM

Wolfgang Nádasi-donner

8/7/2007 11:01:00 AM

Ari Brown wrote:
> Is there an alternate RegExp "language" to the current one in Ruby
> and Perl?

Snobol4 pattern are now available as a Python library. It should be
possible to port it to Ruby. I don't think that the implementation is
complete, because I didn't see the possibility of recursive pattern
definitions, which give Snobol4 the extreme power.

Infos

http://permalink.gmane.org/gmane.comp.python.ann... (Snobol4 in
Python)

http://en.wikipedia.org/w... (has some links)

Wolfgang NÃ¡dasi-Donner
--
Posted via http://www.ruby-....

comp.lang.ruby

Alternate Regular Expressions?

Ari Brown

Phlip

Ari Brown

Tim Hunter

Phlip

Kenneth McDonald

Ari Brown

Kenneth McDonald

Robert Klemme

David A. Black

Wolfgang Nádasi-donner

x Login to ForumsZone