Asp Forum - Re: #scan with or'd (`|`) subexpressions.

Warren Brown

11/11/2004 3:55:00 PM

T.,

> Does the new Ruby regexp engine do this?
>
> irb(main):001:0> '1234'.scan(/(1)(2)|(3)(4)/)
> => [["1", "2", nil, nil], [nil, nil, "3", "4"]]
> irb(main):002:0>
>
> Why would all the subexpressions be listed when there
> is an `|` (or) used?

For collecting matches, Ruby simply looks at opening parenthesis -
nothing else. The part of the string matched by the regular expression
delimited by the first open parenthesis and its matching close
parenthesis will be the first match, the second opening parenthesis and
its matching close parenthesis will define the next match, etc.

I have not yet had an opportunity to play with Oniguruma, so I can't
say definitively if behaves the same way. However I would be very
surprised if it didn't, since virtually every other language behaves
this way.

I hope this helps.

- Warren Brown

4 Answers

T. Onoma

11/11/2004 4:39:00 PM

On Thursday 11 November 2004 10:54 am, Warren Brown wrote:
| T.,
|
| > Does the new Ruby regexp engine do this?
| >
| > irb(main):001:0> '1234'.scan(/(1)(2)|(3)(4)/)
| > => [["1", "2", nil, nil], [nil, nil, "3", "4"]]
| > irb(main):002:0>
| >
| > Why would all the subexpressions be listed when there
| > is an `|` (or) used?
|
| For collecting matches, Ruby simply looks at opening parenthesis -
| nothing else. The part of the string matched by the regular expression
| delimited by the first open parenthesis and its matching close
| parenthesis will be the first match, the second opening parenthesis and
| its matching close parenthesis will define the next match, etc.

I see. Perhaps there is good reason for this. But I just don't see it. IN
practice it causes me to have to strip out a whole lot of nils from the
results. Honestly, I can't see how it makes any sense. The regexp will match
on the first "or" that succeeds, right? So all the others are by necessity
nil. But perhaps I'm overlooking some possibility.

| I have not yet had an opportunity to play with Oniguruma, so I can't
| say definitively if behaves the same way. However I would be very
| surprised if it didn't, since virtually every other language behaves
| this way.

Neither have I. But I do hope Oniguruma is a better than "every other". By the
way, have you read about Perl 6 new RE engine? I must say it look pretty
sweet. Theses are definitely not your average everyday Regular Expressions.
It now allows you to create your own rules and encapsulate those and resuse
them -- much more like a grammer parser.

Thanks,
T.

Peter

11/11/2004 4:57:00 PM

T. Onoma

11/12/2004 2:56:00 AM

On Thursday 11 November 2004 11:57 am, Peter wrote:
| > I see. Perhaps there is good reason for this. But I just don't see it. IN
| > practice it causes me to have to strip out a whole lot of nils from the
| > results. Honestly, I can't see how it makes any sense. The regexp will
| > match on the first "or" that succeeds, right? So all the others are by
| > necessity nil. But perhaps I'm overlooking some possibility.
|
| If those nils are stripped, you loose information about which "or"
| succeeded. In some cases that is not important in that it does not matter
| where the captures come from as long as they are interchangeable, but that
| is certainly not always so. Also it makes interpretation of the captures
| very difficult when the different "ors" have a different number of
| captures:
|
| /(?:(1)(a)|(-))(?:(2)|(b)(+))/
|
| With nils stripped, this can return
| ['1','a','2'],['1','a','b','+'],['-','2'] or ['-','b','+']. If each
| capture needs a different treatment, there's no way to relate the correct
| treatment to the index in the array of captures.

Hi --

Well, I'm not sure how it helps. What I ended up doing was making sure all my
expressions did have the _same number_ of sub-expressions (in this case 7).
So then I could count the preceding nils and divide by 7 to find out which
match. But that's a hack IMHO.

Matz, yourself, and others past have all mentioned being able to figure out
which match, but how?

Thanks,
T.

Simon Strandgaard

11/12/2004 7:19:00 AM

On Friday 12 November 2004 03:55, trans. (T. Onoma) wrote:
[snip]
> Well, I'm not sure how it helps. What I ended up doing was making sure all
> my expressions did have the _same number_ of sub-expressions (in this case
> 7). So then I could count the preceding nils and divide by 7 to find out
> which match. But that's a hack IMHO.
>
> Matz, yourself, and others past have all mentioned being able to figure out
> which match, but how?

I don't know if this helps..

bash-2.05b$ ruby a.rb
"lab0" => [["0", nil, nil]]
"version1-beta" => [[nil, "1", nil]]
"go2ruby" => [[nil, nil, "2"]]
"1 goto 1" => [[nil, "1", nil], [nil, "1", nil]]
"2 1 0" => [[nil, nil, "2"], [nil, "1", nil], ["0", nil, nil]]
bash-2.05b$ expand -t2 a.rb
def s(str)
m = str.scan(/(0)|(1)|(2)/)
puts "#{str.inspect.ljust(15)} => #{m.inspect}"
end
s "lab0"
s "version1-beta"
s "go2ruby"
s "1 goto 1"
s "2 1 0"
bash-2.05b$

--
Simon Strandgaard

comp.lang.ruby

Re: #scan with or'd (`|`) subexpressions.

Warren Brown

T. Onoma

Peter

T. Onoma

Simon Strandgaard

x Login to ForumsZone