Daniel Martin
8/4/2006 2:29:00 PM
Wes Gamble <weyus@att.net> writes:
> If I do /~\^LNK:.*?[\t\r\n]+?.*?\^~/m - that should pick up what I want,
> correct?
Almost.
The problem is that with this text:
a = "~^LNK:foo^~\n\n~^LNK:bar^~"
You get a match of the whole text:
irb(main):009:0> a.scan(/~\^LNK:.*?[\t\r\n]+?.*?\^~/m)
=> ["~^LNK:foo^~\n\n~^LNK:bar^~"]
Where you obviously wanted to get no matches.
So, here's what I suggest:
/~\^LNK:(?:[^\t\r\n^]|\^(?!~))*[\t\r\n].*?\^~/m
Read that as:
'~^LNK:' followed by zero or more of:
Some character that isn't \t, \r, \n, or '^', OR
A '^' character that isn't followed by a '~'
Then a \t, \r, or \n character.
Then whatever is the minimum other characters necessary to get to ^~.
For these "containing at least one of" type problems, I often find it
useful to write the regular expression as:
begin sequence ( ~\^LNK: )
zero or more characters with none of what we want
( (?:[^\t\r\n^]|\^(?!~))* )
one of what we want
( [\t\r\n] )
*? ( .*? )
end sequence ( \^~ )
For the related "at least n of" problem, (where n > 1), I do this:
begin sequence
(?:
zero or more characters with none of what we want
one of what we want
){n}
*?
end sequence
The only tricky part is inside the "none of what we want" chunk, where
you have to take care that the "none of what we want" chunk can't
swallow up your end sequence. (Depending on what you want and what
your end sequence is, you also need to be careful that the "one of
what we want" part can't swallow part of your end sequence)
Sometimes it's easier to just write a regular expression that gets
more matches than you want, and then throw away excess matches in
code:
lnk_regex = /~\^LNK:.*?\^~/
text.scan(lnk_regex) { |m|
next unless m[0] =~ /[\t\r\n]/
...
}
That can often be more readable too. Depending on your data, however,
it may be much, much slower than using a regular expression that finds
only what you need to begin with.