Asp Forum - Negate a character sequence in a regular expression?

crm_114

11/29/2007 9:48:00 PM

For the following string:

'cat sheep horse cat tac dog'

I would like to write a regular expression that matches any substring
that is prefixed by the word 'cat', is then followed by any characters
as long as those characters do not comprise the word 'cat', and then
finally suffixed by the string 'dog'. Therefore, this expression
should match the substring 'cat tac dog' in the above string.

Obviously, if I write an expression like:

irb(main):002:0> /cat.*dog/.match('cat sheep horse cat tac dog').to_s
=> "cat sheep horse cat tac dog"

it will match the entire string.

And the non-greedy Kleene doesn't buy me anything either since the
expression matches the first cat found anyway:

irb(main):003:0> /cat.*?dog/.match('cat sheep horse cat tac dog').to_s
=> "cat sheep horse cat tac dog"

What I think I want to do is to negate a sequence of characters,
rather than just a character class, but I have looked around and not
found anything quite right.

Of course, there are ways of hacking this out, e.g. I could reverse
the string first and match 'god' followed by the first instance of
'tac', but I am hoping there is a more elegant way to do this with a
single regular expression.

Thanks--

11 Answers

James Moore

11/29/2007 11:07:00 PM

I think this gets you close to where you want to be:

irb(main):045:0> 'cat sheep horse cat tac dog' =~ /cat(?!.*cat)(.*)dog/
=> 16
irb(main):046:0> $1
=> " tac "
irb(main):047:0>

The (?! bit is a nonmatching lookahead.

On Nov 29, 2007 2:07 PM, <crm_114@mac.com> wrote:
> For the following string:
>
> 'cat sheep horse cat tac dog'
>
> I would like to write a regular expression that matches any substring
> that is prefixed by the word 'cat', is then followed by any characters
> as long as those characters do not comprise the word 'cat', and then
> finally suffixed by the string 'dog'. Therefore, this expression
> should match the substring 'cat tac dog' in the above string.

I think this gets you closer to where you want to be:

irb(main):045:0> 'cat sheep horse cat tac dog' =~ /cat(?!.*cat)(.*)dog/
=> 16
irb(main):046:0> $1
=> " tac "
irb(main):047:0>

The (?! bit is a nonmatching lookahead.

--
James Moore | james@restphone.com
Ruby and Ruby on Rails consulting
blog.restphone.com

MonkeeSage

11/30/2007 12:26:00 AM

On Nov 29, 3:47 pm, crm_...@mac.com wrote:
> For the following string:
>
> 'cat sheep horse cat tac dog'
>
> I would like to write a regular expression that matches any substring
> that is prefixed by the word 'cat', is then followed by any characters
> as long as those characters do not comprise the word 'cat', and then
> finally suffixed by the string 'dog'. Therefore, this expression
> should match the substring 'cat tac dog' in the above string.
>
> Obviously, if I write an expression like:
>
> irb(main):002:0> /cat.*dog/.match('cat sheep horse cat tac dog').to_s
> => "cat sheep horse cat tac dog"
>
> it will match the entire string.
>
> And the non-greedy Kleene doesn't buy me anything either since the
> expression matches the first cat found anyway:
>
> irb(main):003:0> /cat.*?dog/.match('cat sheep horse cat tac dog').to_s
> => "cat sheep horse cat tac dog"
>
> What I think I want to do is to negate a sequence of characters,
> rather than just a character class, but I have looked around and not
> found anything quite right.
>
> Of course, there are ways of hacking this out, e.g. I could reverse
> the string first and match 'god' followed by the first instance of
> 'tac', but I am hoping there is a more elegant way to do this with a
> single regular expression.
>
> Thanks--

If you just want the right-most match of the substring prefixed by
'cat' and suffixed by 'dog', this should work: /.*(cat.*dog)/.match(s)
[1].

Regards,
Jordan

yermej

11/30/2007 1:22:00 AM

On Nov 29, 5:06 pm, James Moore <jamesthepi...@gmail.com> wrote:
> On Nov 29, 2007 2:07 PM, <crm_...@mac.com> wrote:
>
> > For the following string:
>
> > 'cat sheep horse cat tac dog'
>
> > I would like to write a regular expression that matches any substring
> > that is prefixed by the word 'cat', is then followed by any characters
> > as long as those characters do not comprise the word 'cat', and then
> > finally suffixed by the string 'dog'. Therefore, this expression
> > should match the substring 'cat tac dog' in the above string.
>
> I think this gets you closer to where you want to be:
>
> irb(main):045:0> 'cat sheep horse cat tac dog' =~ /cat(?!.*cat)(.*)dog/
> => 16
> irb(main):046:0> $1
> => " tac "
> irb(main):047:0>
>
> The (?! bit is a nonmatching lookahead.

Nonmatching (or negative) lookahead is what you want, and with some
adjustment of the capture you get:

> 'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/
=> 16
> $1
=> "cat tac dog"

Daniel Sheppard

11/30/2007 1:24:00 AM

> For the following string:
>=20
> 'cat sheep horse cat tac dog'
>=20
> I would like to write a regular expression that matches any substring
> that is prefixed by the word 'cat', is then followed by any characters
> as long as those characters do not comprise the word 'cat', and then
> finally suffixed by the string 'dog'. Therefore, this expression
> should match the substring 'cat tac dog' in the above string.

Working out negative regular expressions is normally best avoided.

One step is hard. Two steps is not:

x =3D 'cat sheep horse cat tac dog'
/(cat.*?dog)/.match(x) && $1.sub(/.*cat/,'cat')

Or if you want multiple matches:

x =3D 'cat sheep horse cat tac dog cat cat sheep dog'
x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
=3D> ["cat tac dog", "cat sheep dog"]

Dan.

Raul Raul

12/1/2007 8:19:00 AM

yermej wrote:
>
> Nonmatching (or negative) lookahead is what you want, and with some
> adjustment of the capture you get:
>
> 'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/
> => 16
> $1
> => "cat tac dog"

Negative lookaheads that contain '.*' are hard to comprehend (at least
for me). It is enough to add a 'cat' at the end and the regexp does not
find any more the 'cat tac dog' that should be matched:

'cat sheep horse cat tac dog lion cat' =~ /(cat(?!.*cat).*dog)/
=> nil

Also notice that, when it works, it will always give the last expression
present:
'cat sheep horse cat tac dog lion cat dog' =~ /(cat(?!.*cat).*dog)/
=> 33
p $1 # => "cat dog"

Daniel Sheppard wrote:
> Working out negative regular expressions is normally best avoided.

I would say that it is true when they contain '.*' type expressions;
else they can be extremely useful.

> Or if you want multiple matches:
>
> x = 'cat sheep horse cat tac dog cat cat sheep dog'
> x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
> => ["cat tac dog", "cat sheep dog"]

Very ingenious...

Raul
--
Posted via http://www.ruby-....

yermej

12/1/2007 9:10:00 AM

On Dec 1, 2:19 am, Raul Parolari <raulparol...@gmail.com> wrote:
> yermej wrote:
>
> > Nonmatching (or negative) lookahead is what you want, and with some
> > adjustment of the capture you get:
>
> > 'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/
> > => 16
> > $1
> > => "cat tac dog"
>
> Negative lookaheads that contain '.*' are hard to comprehend (at least
> for me). It is enough to add a 'cat' at the end and the regexp does not
> find any more the 'cat tac dog' that should be matched:
>
> 'cat sheep horse cat tac dog lion cat' =~ /(cat(?!.*cat).*dog)/
> => nil
>
> Also notice that, when it works, it will always give the last expression
> present:
> 'cat sheep horse cat tac dog lion cat dog' =~ /(cat(?!.*cat).*dog)/
> => 33
> p $1 # => "cat dog"
>
> Daniel Sheppard wrote:
> > Working out negative regular expressions is normally best avoided.
>
> I would say that it is true when they contain '.*' type expressions;
> else they can be extremely useful.
>
> > Or if you want multiple matches:
>
> > x = 'cat sheep horse cat tac dog cat cat sheep dog'
> > x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
> > => ["cat tac dog", "cat sheep dog"]
>
> Very ingenious...
>
> Raul
> --
> Posted viahttp://www.ruby-....

Thanks, Raul, for the clarification on that.

To the original poster, I apologize for the misinformation. I guess
when I'm not completely sure about such things, I should start a new
thread, but attempting to answer questions here (some of which I do
get right) has been a great help to me in my own learning. I think
that starting new threads on all occasions would get to be too much
and I probably wouldn't get many responses. In the future, I may just
keep quiet until I'm either sure I have the correct answer or at least
don't understand why my answer isn't correct.

MonkeeSage

12/1/2007 9:29:00 AM

On Dec 1, 3:10 am, yermej <yer...@gmail.com> wrote:
> On Dec 1, 2:19 am, Raul Parolari <raulparol...@gmail.com> wrote:
>
>
>
> > yermej wrote:
>
> > > Nonmatching (or negative) lookahead is what you want, and with some
> > > adjustment of the capture you get:
>
> > > 'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/
> > > => 16
> > > $1
> > > => "cat tac dog"
>
> > Negative lookaheads that contain '.*' are hard to comprehend (at least
> > for me). It is enough to add a 'cat' at the end and the regexp does not
> > find any more the 'cat tac dog' that should be matched:
>
> > 'cat sheep horse cat tac dog lion cat' =~ /(cat(?!.*cat).*dog)/
> > => nil
>
> > Also notice that, when it works, it will always give the last expression
> > present:
> > 'cat sheep horse cat tac dog lion cat dog' =~ /(cat(?!.*cat).*dog)/
> > => 33
> > p $1 # => "cat dog"
>
> > Daniel Sheppard wrote:
> > > Working out negative regular expressions is normally best avoided.
>
> > I would say that it is true when they contain '.*' type expressions;
> > else they can be extremely useful.
>
> > > Or if you want multiple matches:
>
> > > x = 'cat sheep horse cat tac dog cat cat sheep dog'
> > > x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
> > > => ["cat tac dog", "cat sheep dog"]
>
> > Very ingenious...
>
> > Raul
> > --
> > Posted viahttp://www.ruby-....
>
> Thanks, Raul, for the clarification on that.
>
> To the original poster, I apologize for the misinformation. I guess
> when I'm not completely sure about such things, I should start a new
> thread, but attempting to answer questions here (some of which I do
> get right) has been a great help to me in my own learning. I think
> that starting new threads on all occasions would get to be too much
> and I probably wouldn't get many responses. In the future, I may just
> keep quiet until I'm either sure I have the correct answer or at least
> don't understand why my answer isn't correct.

Not to blow my own horn, but I think the behavior requested by the OP
is still /.*(cat.*dog)/...("a regular expression that matches any
substring
that is prefixed by the word 'cat', is then followed by any characters
as long as those characters do not comprise the word 'cat', and then
finally suffixed by the string 'dog'").

But don't feel like you have to be 100% correct in order to help out.
My answers are often bass-ackwards (not really a bragging point, heh).
But the point is, like you say, to learn from each other and help
where we can. None of us knows it all; all we can do is our best, and
pray it might help someone here or there. :)

Regards,
Jordan

Tanaka Akira

12/1/2007 9:35:00 AM

In article <238b37f9-2df1-42b8-b821-dbb6c24a9d9e@s8g2000prg.googlegroups.com>,
crm_114@mac.com writes:

> For the following string:
>
> 'cat sheep horse cat tac dog'
>
> I would like to write a regular expression that matches any substring
> that is prefixed by the word 'cat', is then followed by any characters
> as long as those characters do not comprise the word 'cat', and then
> finally suffixed by the string 'dog'. Therefore, this expression
> should match the substring 'cat tac dog' in the above string.

% /usr/bin/ruby -e 'p /cat((?!cat).)*dog/.match("cat sheep horse cat tac dog").to_s'
"cat tac dog"
--
Tanaka Akira

Raul Raul

12/1/2007 8:23:00 PM

yermej wrote:

>> Raul
>> --
>> Posted viahttp://www.ruby-....
>
> Thanks, Raul, for the clarification on that.
>
> To the original poster, I apologize for the misinformation.
> In the future, I may just keep quiet until I'm either sure I have
> the correct answer or at least
> don't understand why my answer isn't correct.

yermej,

I often follow your postings, that I find very interesting. But do not
take hard this; no one can be perfect all of the time.

In this case, it is very easy to fall in the alluring trap of the
negative lookahead with '.*' (see how many people keep posting a variant
of that solution.. ?!).

However, I totally share one's feeling of dismay when we give an advice
that turns out not to be right (or just not completely right); we just
have to accept that we are fallible.

In this case hats off to Daniel Sheppard, who showed us how to do it (I
rewrite it to reinstate it, after all the emails that followed):

x = 'cat sheep horse cat tac dog cat cat sheep dog cat'

x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}

=> ["cat tac dog", "cat sheep dog"]

Absolutely brilliant

Raul

--
Posted via http://www.ruby-....

crm_114

12/2/2007 12:39:00 AM

Hello all - this was my first post, so thank you James, Jordan,
yermej, Daniel, Raul and Tanaka. I really appreciate the help. I've
learned quite a bit - especially about negative lookahead. Tricky
stuff.

As you might expect, I am working on a script to scan a big file for
multiple cat...dog pairs. Well not cats and dogs, more like time
stamps, etc.

I've tried all the solutions posted here and they work very nicely for
the string I posted. But I botched the original specification in at
least two ways:

1. I should have mentioned that there might be multiple cat...dog
pairs. Then James would probably have given me something looking like
Tanaka's solution, which uses negative lookahead but only consumes one
character at a time.

2. I should have mentioned that the characters between cat and dog
should contain neither 'cat' nor 'dog'. I'm really looking for the
'innermost' cat...dog pair.

So I tried Daniel's solution and that works pretty well. If there is
a way to break it I did not find it.

I also tweaked Tanaka's solution a little bit after a trip back to the
pickaxe book. I added the (?: to keep the group from generating
backreferences, and I added an alternate 'dog' to the negative
assertion. This is confusing but it seems to work:

irb(main):001:0> x = 'cat sheep horse cat tac dog cat cat sheep dog
dog '
=> "cat sheep horse cat tac dog cat cat sheep dog dog "
irb(main):002:0> x.scan(/cat(?:(?!cat|dog).)*dog/)
=> ["cat tac dog", "cat sheep dog"]

Here is my attempt at breaking it down, for anyone who is interested:
/cat(?:(?!cat|dog).)*dog/
1 2 3 4 5 6

1. The engine starts scanning through the string until it matches
'cat'.

2. At the first position after the t in cat, the engine tries to
match this expression (?:(?!cat|dog).), zero or more times. The (?:
is nothing special. It is just a grouping that does not generate
backreferences.

3. The negative lookahead assertion. If the regular expression
following the ! matches the string, starting from the current position
of the engine, then the expression to which the assertion belongs will
fail. But since the negative assertion belongs to a grouping that can
match 0 or more times, the assertion can fail but the full regular
expression can succeed.
The hard part for me is that the negative lookahead assertion does
not consume any characters. It looks ahead, tries to match its
regular expression, and whether it returns 'MATCH' or 'NO MATCH', the
current position of the engine stays the same.

4. The regexp for the negative assertion. The stuff between each
cat...dog pair should contain neither 'cat' nor 'dog'.

5. One character is consumed, the engine moves down the string one
position. Thanks to the negative lookahead, we know that that the
consumed character is not the start of a 'cat' or 'dog' substring.

6. Eventually, the (?:(?!cat|dog).)* fails when the current position
of the engine reaches the beginning of a 'dog' substring. But then
the last part of the regular expression (6) will match.

---

Also, another way of getting the 'innermost' cat...dog pair is to use
non-greedy kleene star(*?). This way, the engine will take the first
'dog' suffix that it finds, rather than the last possible:

irb(main):003:0> x.scan(/cat(?:(?!cat).)*?dog/)
=> ["cat tac dog", "cat sheep dog"]

So there is more than one way to scan a cat! (Sorry!!!)

....In any event, I think I still like Daniel's solution better,
because I can look at it and feel fairly certain that it will do
exactly what it should do.

Thanks--
Josh

comp.lang.ruby

Negate a character sequence in a regular expression?

crm_114

James Moore

MonkeeSage

yermej

Daniel Sheppard

Raul Raul

yermej

MonkeeSage

Tanaka Akira

Raul Raul

crm_114

x Login to ForumsZone