crm_114
12/2/2007 12:39:00 AM
Hello all - this was my first post, so thank you James, Jordan,
yermej, Daniel, Raul and Tanaka. I really appreciate the help. I've
learned quite a bit - especially about negative lookahead. Tricky
stuff.
As you might expect, I am working on a script to scan a big file for
multiple cat...dog pairs. Well not cats and dogs, more like time
stamps, etc.
I've tried all the solutions posted here and they work very nicely for
the string I posted. But I botched the original specification in at
least two ways:
1. I should have mentioned that there might be multiple cat...dog
pairs. Then James would probably have given me something looking like
Tanaka's solution, which uses negative lookahead but only consumes one
character at a time.
2. I should have mentioned that the characters between cat and dog
should contain neither 'cat' nor 'dog'. I'm really looking for the
'innermost' cat...dog pair.
So I tried Daniel's solution and that works pretty well. If there is
a way to break it I did not find it.
I also tweaked Tanaka's solution a little bit after a trip back to the
pickaxe book. I added the (?: to keep the group from generating
backreferences, and I added an alternate 'dog' to the negative
assertion. This is confusing but it seems to work:
irb(main):001:0> x = 'cat sheep horse cat tac dog cat cat sheep dog
dog '
=> "cat sheep horse cat tac dog cat cat sheep dog dog "
irb(main):002:0> x.scan(/cat(?:(?!cat|dog).)*dog/)
=> ["cat tac dog", "cat sheep dog"]
Here is my attempt at breaking it down, for anyone who is interested:
/cat(?:(?!cat|dog).)*dog/
1 2 3 4 5 6
1. The engine starts scanning through the string until it matches
'cat'.
2. At the first position after the t in cat, the engine tries to
match this expression (?:(?!cat|dog).), zero or more times. The (?:
is nothing special. It is just a grouping that does not generate
backreferences.
3. The negative lookahead assertion. If the regular expression
following the ! matches the string, starting from the current position
of the engine, then the expression to which the assertion belongs will
fail. But since the negative assertion belongs to a grouping that can
match 0 or more times, the assertion can fail but the full regular
expression can succeed.
The hard part for me is that the negative lookahead assertion does
not consume any characters. It looks ahead, tries to match its
regular expression, and whether it returns 'MATCH' or 'NO MATCH', the
current position of the engine stays the same.
4. The regexp for the negative assertion. The stuff between each
cat...dog pair should contain neither 'cat' nor 'dog'.
5. One character is consumed, the engine moves down the string one
position. Thanks to the negative lookahead, we know that that the
consumed character is not the start of a 'cat' or 'dog' substring.
6. Eventually, the (?:(?!cat|dog).)* fails when the current position
of the engine reaches the beginning of a 'dog' substring. But then
the last part of the regular expression (6) will match.
---
Also, another way of getting the 'innermost' cat...dog pair is to use
non-greedy kleene star(*?). This way, the engine will take the first
'dog' suffix that it finds, rather than the last possible:
irb(main):003:0> x.scan(/cat(?:(?!cat).)*?dog/)
=> ["cat tac dog", "cat sheep dog"]
So there is more than one way to scan a cat! (Sorry!!!)
....In any event, I think I still like Daniel's solution better,
because I can look at it and feel fairly certain that it will do
exactly what it should do.
Thanks--
Josh