Asp Forum - Non-greediness in a regex - need some help verifying syntax

Wes Gamble

8/3/2006 9:31:00 PM

All,

Need some medium - level regex help.

Here's my regex: /~\^LNK:[\t\r\n]+?\^~/m

I'm trying to find all occurrences of strings in my big string that are
between
~^LNK: and ^~ sequences of characters that have at least one tab, form
feed, or newline character between those two characters. I use the
multiline option so that I can match on the newlines.

What I'm seeing is the string that is consumed by this regex spans many
many many
~^LNK ^~ pairs so that I am removing a bunch of tabs, newlines, etc.
that I don't want to.

I understand the concept of greediness in regexes, so I put the ? after
the [\t\r\n] sequence.

Why is the match spanning so many pairs of the delimiter sequences? Why
doesn't regex engine stop attempting to match when it sees that first ^~
after the ~^LNK:?

Any help is appreciated.

Thanks,
Wes

--
Posted via http://www.ruby-....

5 Answers

Wes Gamble

8/3/2006 10:10:00 PM

I realized I made an error when I did the original post.

Now my problem is that it won't find any of these occurrences now.

So @bigstring.scan(/~\^LNK:[\t\r\n]+?\^~/m) isn't returning anything.
My guess is because there are no occurrences of a tab, newline, or line
feed character _immediately_ after ~^LNK.

If I do /~\^LNK:.*?[\t\r\n]+?.*?\^~/m - that should pick up what I want,
correct?

Wes

--
Posted via http://www.ruby-....

Morton Goldberg

8/3/2006 10:51:00 PM

I'm not exactly an expert on regexs, to say the least, but I
think .*? always matches an empty string and is therefore useless. I
would try something like

@bigstring.scan(/~\^LNK:[^\t\r\n]*[\t\r\n]+?[^\t\r\n]*\^~/m)

I have not test this.

Regards, Morton

On Aug 3, 2006, at 6:09 PM, Wes Gamble wrote:

> I realized I made an error when I did the original post.
>
> Now my problem is that it won't find any of these occurrences now.
>
> So @bigstring.scan(/~\^LNK:[\t\r\n]+?\^~/m) isn't returning anything.
> My guess is because there are no occurrences of a tab, newline, or
> line
> feed character _immediately_ after ~^LNK.
>
> If I do /~\^LNK:.*?[\t\r\n]+?.*?\^~/m - that should pick up what I
> want,
> correct?
>
> Wes
>
> --
> Posted via http://www.ruby-....
>

Justin Collins

8/3/2006 11:00:00 PM

On Aug 3, 2006, at 6:09 PM, Wes Gamble wrote:

> I realized I made an error when I did the original post.
>
> Now my problem is that it won't find any of these occurrences now.
>
> So @bigstring.scan(/~\^LNK:[\t\r\n]+?\^~/m) isn't returning anything.
> My guess is because there are no occurrences of a tab, newline, or line
> feed character _immediately_ after ~^LNK.
>
> If I do /~\^LNK:.*?[\t\r\n]+?.*?\^~/m - that should pick up what I want,
> correct?
>
> Wes
>
> --Posted via http://www.ruby-....

Morton Goldberg wrote:
> I'm not exactly an expert on regexs, to say the least, but I think .*?
> always matches an empty string and is therefore useless. I would try
> something like
>
> @bigstring.scan(/~\^LNK:[^\t\r\n]*[\t\r\n]+?[^\t\r\n]*\^~/m)
>
> I have not test this.
>
> Regards, Morton

No, that's not true. .*? It will match whatever it needs to get to the
next item:

irb(main):001:0> "asidjoaisdj".match(/.*?j/)[0]
=> "asidj"
irb(main):002:0> "asidjoaisdj".match(/.*?d/)[0]
=> "asid"
irb(main):003:0> "asidjoaisdj".match(/.*?sdj/)[0]
=> "asidjoaisdj"

Therefore, it should work as Wes expects.

-Justin

Daniel Martin

8/4/2006 2:29:00 PM

Wes Gamble <weyus@att.net> writes:

> If I do /~\^LNK:.*?[\t\r\n]+?.*?\^~/m - that should pick up what I want,
> correct?

Almost.

The problem is that with this text:

a = "~^LNK:foo^~\n\n~^LNK:bar^~"

You get a match of the whole text:

irb(main):009:0> a.scan(/~\^LNK:.*?[\t\r\n]+?.*?\^~/m)
=> ["~^LNK:foo^~\n\n~^LNK:bar^~"]

Where you obviously wanted to get no matches.

So, here's what I suggest:

/~\^LNK:(?:[^\t\r\n^]|\^(?!~))*[\t\r\n].*?\^~/m

Read that as:

'~^LNK:' followed by zero or more of:
Some character that isn't \t, \r, \n, or '^', OR
A '^' character that isn't followed by a '~'
Then a \t, \r, or \n character.
Then whatever is the minimum other characters necessary to get to ^~.

For these "containing at least one of" type problems, I often find it
useful to write the regular expression as:

begin sequence ( ~\^LNK: )
zero or more characters with none of what we want
( (?:[^\t\r\n^]|\^(?!~))* )
one of what we want
( [\t\r\n] )
*? ( .*? )
end sequence ( \^~ )

For the related "at least n of" problem, (where n > 1), I do this:

begin sequence
(?:
zero or more characters with none of what we want
one of what we want
){n}
*?
end sequence

The only tricky part is inside the "none of what we want" chunk, where
you have to take care that the "none of what we want" chunk can't
swallow up your end sequence. (Depending on what you want and what
your end sequence is, you also need to be careful that the "one of
what we want" part can't swallow part of your end sequence)

Sometimes it's easier to just write a regular expression that gets
more matches than you want, and then throw away excess matches in
code:

lnk_regex = /~\^LNK:.*?\^~/
text.scan(lnk_regex) { |m|
next unless m[0] =~ /[\t\r\n]/
...
}

That can often be more readable too. Depending on your data, however,
it may be much, much slower than using a regular expression that finds
only what you need to begin with.

Wes Gamble

8/4/2006 3:07:00 PM

Daniel,

Currently, I have this working using the .*? to match everything since I
am just passing the results into a block that then does a gsub on the
offending characters. Slightly inefficient, but as you pointed out,
much more readable.

Thanks for the through regex analysis though.

Wes

--
Posted via http://www.ruby-....

comp.lang.ruby

Non-greediness in a regex - need some help verifying syntax

Wes Gamble

Wes Gamble

Morton Goldberg

Justin Collins

Daniel Martin

Wes Gamble

x Login to ForumsZone