Ross Bamford
1/7/2006 3:37:00 AM
On Sat, 07 Jan 2006 02:56:50 -0000, <dblack@wobblini.net> wrote:
> Hi --
>
> For some reason, lookbehind and alternation seem not to be playing
> together in a little Oniguruma test. This is based on the string
> splitting thread from a little while ago this evening, and uses a CVS
> 1.9.0 Ruby acquired about 1/2 an hour ago.
>
> str = %Q{abc def "ghi jkl" mno}
>
> # Look for "..." but just get the ... part:
> re1 = /(?<=")[^"]+(?=")/
>
> # Test that:
> p str.scan(re1) # => ["ghi jkl"]
>
> # Now, do the same thing *or* \S+. This should, I think,
> # pick up the abc, def, and mno substrings too.
>
> re2 = /((?<=")[^"]+(?="))|(\S+)/
>
> # But it doesn't; the part before the alternation never
> # matches, even though it did before (as shown by the
> # captures):
>
> p str.scan(re2)
> # => [[nil, "abc"], [nil, "def"], [nil, "\"ghi"], [nil, "jkl\""],
> # [nil, "mno"]]
>
> I know that's all a bit cluttered, but the basic thing is that a
> sub-pattern using lookbehind doesn't seem to match any more when
> there's an alternation. Instead, only the second alternative ever
> matches.
>
> Does anyone know why?
I'm not at all sure about this, but this is my take on it. Firstly, is
this the behaviour you expected?
str = %Q{abc def "ghi jkl" mno}
re2 = /(?:(?<=")[^"]+(?="))|\S+/
p str.scan(re2)
# => ["abc", "def", "\"ghi", "jkl\"", "mno"]
?
If so, then I believe the problem is something to do with the fact that
lookaround is atomic, so that when used with capturing groups and
alternations you sometimes experience problems because the regex
immediately forgets the (zero-width, remember) lookaround match, so that
by the time it comes to that 'or' it doesn't have the information to
compare.
Generally, there are restrictions with lookaround (esp lookbehind)
matching, and especially when matching regexps. So far my experiments with
Oniguruma suggest it's fairly sophisticated in this respect, supporting
stuff like varying-width alternations, fixed repetition and optional
groups in lookbehind, but of course still no star and plus.
Anyway, that's what I think. Hope it helps :)
Cheers,
--
Ross Bamford - rosco@roscopeco.remove.co.uk