Asp Forum - Maybe) a simple question about regex

Sam Kong

3/24/2005 1:48:00 AM

Hello!

I think that I am missing a very simple concept about regex.

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

What I want is:

s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]

What should some_regex be?

Can somebody help me?

Sam

8 Answers

Assaph Mehr

3/24/2005 2:05:00 AM

> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.
>
> What I want is:
>
> s = '0123456789'
> s.scan(some_regex) #-> ["01", "23", "67", "89"]

Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]

You need to tune it to your exact domain.

Cheers,
Assaph

Carlos

3/24/2005 2:08:00 AM

[Sam Kong <sam.s.kong@gmail.com>, 2005-03-24 02.49 CET]
> Hello!
>
> I think that I am missing a very simple concept about regex.
>
> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.

You can use a "negative lookahead assertion":

s.scan(/(?!45)\d\d/)

This means, at every point the regex tries to match, "if the next two
characters aren't "45", match \d\d".

HTH.
--

Jason Sweat

3/24/2005 2:09:00 AM

On Thu, 24 Mar 2005 10:49:49 +0900, Sam Kong <sam.s.kong@gmail.com> wrote:
> Hello!
>
> I think that I am missing a very simple concept about regex.
>
> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.
>
> What I want is:
>
> s = '0123456789'
> s.scan(some_regex) #-> ["01", "23", "67", "89"]
>
> What should some_regex be?

You can use a negative assertion to say you want to skip "45", but it
will bump forward one space and you will end up with the last matches
being "56" and "78"

>> s.scan(/(?!45)\d\d/)
=> ["01", "23", "56", "78"]

So with a little uglier assertion, you can say:

>> s.scan(/(?!45|5)\d\d/)
=> ["01", "23", "67", "89"]

and get what you specified, but though it works for your toy case, I
would be worried that this might not extrapolate out to your real goal
well.

HTH

Regards,
Jason
http://blog.casey...

Patrick Hurley

3/24/2005 2:51:00 AM

What they said, but also if you can be more precise about your real
problem, we might be able to better model a solution. You might find
matching the expression you want and then scanning it to be more
flexible for example.

On Thu, 24 Mar 2005 11:09:51 +0900, Assaph Mehr <assaph@gmail.com> wrote:
>
> > s = '0123456789'
> > s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
> >
> > Now I want to exclude "45".
> > How can I express it in the regex?
> > When it's only one character, I can use ^.
> > But for 2 characters, I don't think I can use it.
> >
> > What I want is:
> >
> > s = '0123456789'
> > s.scan(some_regex) #-> ["01", "23", "67", "89"]
>
> Negative lookahead:
> s.scan /(?!4|5)\d\d/
> Note the OR sign ('|') between the digits, otherwise it would produce:
> ["01", "23", "56", "78"]
>
> You need to tune it to your exact domain.
>
> Cheers,
> Assaph
>
>

Robert Klemme

3/24/2005 8:09:00 AM

"Assaph Mehr" <assaph@gmail.com> schrieb im Newsbeitrag
news:1111629894.417238.111830@l41g2000cwc.googlegroups.com...
>
> > s = '0123456789'
> > s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
> >
> > Now I want to exclude "45".
> > How can I express it in the regex?
> > When it's only one character, I can use ^.
> > But for 2 characters, I don't think I can use it.
> >
> > What I want is:
> >
> > s = '0123456789'
> > s.scan(some_regex) #-> ["01", "23", "67", "89"]
>
> Negative lookahead:
> s.scan /(?!4|5)\d\d/
> Note the OR sign ('|') between the digits, otherwise it would produce:
> ["01", "23", "56", "78"]

But:

>> s = '01234567894657'
=> "01234567894657"
>> s.scan /(?!4|5)\d\d/
=> ["01", "23", "67", "89", "65"]
>> s.scan /\d\d/
=> ["01", "23", "45", "67", "89", "46", "57"]

IOW, you loose "46" and "57".

I prefer a non RE solution in these cases as it's simpler

>> s.scan(/\d\d/).reject {|x| "45" == x}
=> ["01", "23", "67", "89", "46", "57"]

Otherwise RE becomes really complex if you want to make it right - if it's
possible at all (see other postings).

Kind regards

robert

Sam Kong

3/24/2005 9:06:00 AM

Thank you and other posters for the answers.
Actually s.scan(/(?!45)\d\d/) suffices my real problem.

What I was trying to solve was...
To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.
Actually my toy case was not well-defined (I realized this later) and
thus it required more complex solutions like your second case -
s.scan(/(?!45|5)\d\d/) .

I think non-RE solution would be better like Mr. Robert Klemme said.
But I wanted to learn some RE.

Thanks.
Sam

Simon Strandgaard

3/24/2005 11:00:00 AM

On Thu, 24 Mar 2005 18:09:50 +0900, Sam Kong <sam.s.kong@gmail.com> wrote:
> To extract url's from an html source which includes list of sites.
> They are formatted like <a href="something.html">.
> But I wanted to exclude <a href="index.html"> from the list.
> So (?!index.html) will do.

does this help?

ary=%w(a.html index.html other.txt evil.html.exe stuff.html)
ary.select{|s| s =~ /\A(?!index).*\.html\z/ } #=> ["a.html", "stuff.html"]

--
Simon Strandgaard

Csaba Henk

3/25/2005 1:26:00 PM

On 2005-03-24, Sam Kong <sam.s.kong@gmail.com> wrote:
> What I was trying to solve was...
> To extract url's from an html source which includes list of sites.
> They are formatted like <a href="something.html">.
> But I wanted to exclude <a href="index.html"> from the list.
> So (?!index.html) will do.
> Actually my toy case was not well-defined (I realized this later) and
> thus it required more complex solutions like your second case -
> s.scan(/(?!45|5)\d\d/) .

Why don't you use a dedicated html parser? Eg. there's htmltokenizer,
available ar Rubyforge, quite lightweight and very easy to use, but
there are others, of course.

> I think non-RE solution would be better like Mr. Robert Klemme said.
> But I wanted to learn some RE.

This thread was useful, I admit :)

Csaba

comp.lang.ruby

Maybe) a simple question about regex

Sam Kong

Assaph Mehr

Carlos

Jason Sweat

Patrick Hurley

Robert Klemme

Sam Kong

Simon Strandgaard

Csaba Henk

x Login to ForumsZone