Asp Forum - Restricted capture in Regexp

benjohn

12/13/2006 9:24:00 AM

Is there a regexp feature that lets me require something to be present
in the input string for the regexp to match, but for that to not become
captured as part of the match?

I want this so that I can scan and gsub on a string of code and replace
variables. Matching just variables requires looking at the context
arround them, but if I capture this, I replace the context too.

Eg, to scan for variables called x or y, I might use:
/(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/

but using that on "exp(x)" will match (and replace) "(x)", which I don't
want at all.

Cheers,
Benjohn

15 Answers

Paul Lutus

12/13/2006 9:54:00 AM

benjohn@fysh.org wrote:

> Is there a regexp feature that lets me require something to be present
> in the input string for the regexp to match, but for that to not become
> captured as part of the match?

Neither yes nor no, because of how you have worded your question. Se below.

> I want this so that I can scan and gsub on a string of code and replace
> variables. Matching just variables requires looking at the context
> arround them, but if I capture this, I replace the context too.
>
> Eg, to scan for variables called x or y, I might use:
> /(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/
>
> but using that on "exp(x)" will match (and replace) "(x)", which I don't
> want at all.

There are a number of ways to accomplish this. The simplest is to put the
part you want to preserve in parentheses, and refer to it in the
replacement.

Like this:

data.sub!(%r{(^|[^a-zA-Z])([xy])([^a-zA-Z]|$)},"\\1\\2\\3")

Notice about this example that the [xy] character class is now captured and
used as part of the replacement, so its original value is preserved.

Using this approach, you preserve the parts you don't want to replace, and
replace the parts you do. In the above example, everything is preserved,
but it is just meant to show the pattern.

--
Paul Lutus
http://www.ara...

Bertram Scharpf

12/13/2006 10:11:00 AM

Hi Benjohn,

Am Mittwoch, 13. Dez 2006, 18:24:08 +0900 schrieb benjohn@fysh.org:
> Eg, to scan for variables called x or y, I might use:
> /(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/
>
> but using that on "exp(x)" will match (and replace) "(x)", which I don't
> want at all.

/\b[xy]\b/

The \b pattern (word boundary) will look to the left like the ^ pattern
does.

I would appreciate if there were a general pattern looking to the left
corresponding to (?=re) what is non-consuming to the right.

Bertram

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-...

benjohn

12/13/2006 10:24:00 AM

> benjohn@fysh.org wrote:
>
>> Is there a regexp feature that lets me require something to be present
>> in the input string for the regexp to match, but for that to not
>> become
>> captured as part of the match?
>
> Neither yes nor no, because of how you have worded your question. Se
> below.
>
>> I want this so that I can scan and gsub on a string of code and
>> replace
>> variables. Matching just variables requires looking at the context
>> arround them, but if I capture this, I replace the context too.
>>
>> Eg, to scan for variables called x or y, I might use:
>> /(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/
>>
>> but using that on "exp(x)" will match (and replace) "(x)", which I
>> don't
>> want at all.
>
> There are a number of ways to accomplish this. The simplest is to put
> the
> part you want to preserve in parentheses, and refer to it in the
> replacement.
>
> Like this:
>
> data.sub!(%r{(^|[^a-zA-Z])([xy])([^a-zA-Z]|$)},"\\1\\2\\3")
>
> Notice about this example that the [xy] character class is now captured
> and
> used as part of the replacement, so its original value is preserved.
>
> Using this approach, you preserve the parts you don't want to replace,
> and
> replace the parts you do. In the above example, everything is preserved,
> but it is just meant to show the pattern.

Hi Paul,

thanks for the reply. I know I can do this, but it means that the
substitution ("\\1\\2\\3") has to be aware of the composition of the
regular expression. The Regexp is no longer a neat little machine that
only grabs things to replace. It's now grabbing the packaging around the
thing to replace too, so you've got to be aware of this in writing the
substitution.

Cheers,
Benjohn

benjohn

12/13/2006 10:28:00 AM

> Hi Benjohn,
>
> Am Mittwoch, 13. Dez 2006, 18:24:08 +0900 schrieb benjohn@fysh.org:
>> Eg, to scan for variables called x or y, I might use:
>> /(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/
>>
>> but using that on "exp(x)" will match (and replace) "(x)", which I
>> don't
>> want at all.
>
> /\b[xy]\b/
>
> The \b pattern (word boundary) will look to the left like the ^ pattern
> does.

This seems like the best approach in this case, as it's a good enough
way to find variables. It does break down in the complex case though.

> I would appreciate if there were a general pattern looking to the left
> corresponding to (?=re) what is non-consuming to the right.

The book I'm reading (o'reilly pocket reference) hints at the look
arround constructs being:

(?=...) - look ahead.
(?!...) - negated look ahead.
(?<=...) - look behind.
(?<!...) - negated look behind.

So perhaps one of those is what you want?

Paul Lutus

12/13/2006 10:45:00 AM

benjohn@fysh.org wrote:

/ ...

> thanks for the reply. I know I can do this, but it means that the
> substitution ("\\1\\2\\3") has to be aware of the composition of the
> regular expression.

Yes, that is true for all regular expressions.

> The Regexp is no longer a neat little machine that
> only grabs things to replace. It's now grabbing the packaging around the
> thing to replace too, so you've got to be aware of this in writing the
> substitution.

Yes, but this cannot be avoided. You have two choices for examined text that
surrounds the area to be modified -- you can capture it while examining it,
and use the captured text in the replacement, or you can use non-capturing
references:

(?=non-captured text)

But the two alternatives work much the same way -- they examine text that is
preserved as part of the overall regular expression. All that changes
is /how/ the text is preserved.

So, to move ahead, please post a specific example of what you need. Post an
example of the original string and the desired replacement.

It is scarcely possible to describe in prose what one wants from a regular
expression. It /is/ possible to take a first step by posting an example of
original text, and replacement text. Maybe we should try that.

--
Paul Lutus
http://www.ara...

benjohn

12/13/2006 12:18:00 PM

> benjohn@fysh.org wrote:
>
> / ...
>
>> thanks for the reply. I know I can do this, but it means that the
>> substitution ("\\1\\2\\3") has to be aware of the composition of the
>> regular expression.
>
> Yes, that is true for all regular expressions.
>
>> The Regexp is no longer a neat little machine that
>> only grabs things to replace. It's now grabbing the packaging around
>> the
>> thing to replace too, so you've got to be aware of this in writing the
>> substitution.
>
> Yes, but this cannot be avoided. You have two choices for examined text
> that
> surrounds the area to be modified -- you can capture it while examining
> it,
> and use the captured text in the replacement, or you can use
> non-capturing
> references:
>
> (?=non-captured text)

I think this may be what I should use. Also, the sugestion of using word
edge tokens works for the specific case.

>
> But the two alternatives work much the same way -- they examine text
> that is
> preserved as part of the overall regular expression. All that changes
> is /how/ the text is preserved.
>
> So, to move ahead, please post a specific example of what you need. Post
> an
> example of the original string and the desired replacement.

:) Well, I have a solution for the specific case. That's not what I'm
getting at though. I'm trying to find out if regexp allow me to do
something more general. I want to do this (sorry, I don't have a ruby to
hand):

class CodeFragment
attr_accessor :code_fragment

def variables_regexp
/\b[xyz]\b/
end

def utilised_variables
code_fragment.scan(variables_regexp).uniq.sort
end

def output_substitution(substitutes)
code_fragment.gsub(variables_regexp) do |v|
substitutes[v[0]]
end
end
end

cf = CodeFragment.new
cf.code_fragment = "sin(x+y)"
puts cf.output_substitution({'x'=>1, 'y'=>2})

should give "sin(1+2)"

What I want is for the thing that provides the regular expression to not
need to know about the function that is using it; and for the functions
that uses the regular expression to not know about the expression
provided.

> regular
> expression. It /is/ possible to take a first step by posting an example
> of
> original text, and replacement text. Maybe we should try that.

Thank you for your help here.

I'm not trying to solve a single problem though, I'm trying to
understant what kinds of problem I can solve.

I want something that acts as an abstract machine for finding things in
a string (in this case variables, but the rules could be more complex).
One should be able to use this machine without knowing what it finds, or
how it finds. All I should need to know is that it finds things. I'm
trying to understand if regexps are able to do this - to provide this
separation. Perhaps they don't, which is fine. I'd just like to know if
they do or not, or if they do a bit, how much.

Thanks,
Benjohn

Simon Strandgaard

12/13/2006 2:37:00 PM

On 12/13/06, benjohn@fysh.org <benjohn@fysh.org> wrote:
[snip]
> cf = CodeFragment.new
> cf.code_fragment = "sin(x+y)"
> puts cf.output_substitution({'x'=>1, 'y'=>2})
>
> should give "sin(1+2)"
[snip]

prompt> cat a.rb
s = "sin(x+y)"
h = {
'x' => '1',
'y' => '2',
}
h.each do |pattern, replacement|
r = Regexp.new('\b' + Regexp.escape(pattern) + '\b')
s.gsub!(r) { replacement }
end
p s

prompt> ruby a.rb
"sin(1+2)"

--
Simon Strandgaard
http://opc...

Simon Strandgaard

12/13/2006 2:45:00 PM

On 12/13/06, benjohn@fysh.org <benjohn@fysh.org> wrote:
[snip]
> I want something that acts as an abstract machine for finding things in
> a string (in this case variables, but the rules could be more complex).
> One should be able to use this machine without knowing what it finds, or
> how it finds. All I should need to know is that it finds things. I'm
> trying to understand if regexps are able to do this - to provide this
> separation. Perhaps they don't, which is fine. I'd just like to know if
> they do or not, or if they do a bit, how much.

In a language like ruby, its not possible to distinguish between
a variablename or a methodname by just looking at the name.
Regexp just looks at the name.

If you want to replace a variable-name then you need to
parse the code.

--
Simon Strandgaard
http://opc...

Paul Lutus

12/13/2006 5:34:00 PM

benjohn@fysh.org wrote:

>> benjohn@fysh.org wrote:
>>
>> / ...
>>
>>> thanks for the reply. I know I can do this, but it means that the
>>> substitution ("\\1\\2\\3") has to be aware of the composition of the
>>> regular expression.
>>
>> Yes, that is true for all regular expressions.
>>
>>> The Regexp is no longer a neat little machine that
>>> only grabs things to replace. It's now grabbing the packaging around
>>> the
>>> thing to replace too, so you've got to be aware of this in writing the
>>> substitution.
>>
>> Yes, but this cannot be avoided. You have two choices for examined text
>> that
>> surrounds the area to be modified -- you can capture it while examining
>> it,
>> and use the captured text in the replacement, or you can use
>> non-capturing
>> references:
>>
>> (?=non-captured text)
>
> I think this may be what I should use. Also, the sugestion of using word
> edge tokens works for the specific case.
>
>>
>> But the two alternatives work much the same way -- they examine text
>> that is
>> preserved as part of the overall regular expression. All that changes
>> is /how/ the text is preserved.
>>
>> So, to move ahead, please post a specific example of what you need. Post
>> an
>> example of the original string and the desired replacement.
>
> :) Well, I have a solution for the specific case. That's not what I'm
> getting at though. I'm trying to find out if regexp allow me to do
> something more general. I want to do this (sorry, I don't have a ruby to
> hand):
>
> class CodeFragment
> attr_accessor :code_fragment
>
> def variables_regexp
> /\b[xyz]\b/
> end
>
> def utilised_variables
> code_fragment.scan(variables_regexp).uniq.sort
> end
>
> def output_substitution(substitutes)
> code_fragment.gsub(variables_regexp) do |v|
> substitutes[v[0]]
> end
> end
> end
>
> cf = CodeFragment.new
> cf.code_fragment = "sin(x+y)"
> puts cf.output_substitution({'x'=>1, 'y'=>2})
>
> should give "sin(1+2)"
>
> What I want is for the thing that provides the regular expression to not
> need to know about the function that is using it; and for the functions
> that uses the regular expression to not know about the expression
> provided.
>
>> regular
>> expression. It /is/ possible to take a first step by posting an example
>> of
>> original text, and replacement text. Maybe we should try that.
>
> Thank you for your help here.
>
> I'm not trying to solve a single problem though, I'm trying to
> understant what kinds of problem I can solve.
>
> I want something that acts as an abstract machine for finding things in
> a string (in this case variables, but the rules could be more complex).
> One should be able to use this machine without knowing what it finds, or
> how it finds. All I should need to know is that it finds things. I'm
> trying to understand if regexps are able to do this - to provide this
> separation. Perhaps they don't, which is fine. I'd just like to know if
> they do or not, or if they do a bit, how much.

Again, your prose description is not precise enough for a reader to know
exactly what you want, which is why we have such things as computer
languages and mathematics. But one can offer educated guesses.

Here is a function that doesn't know in advance what will be sought, it
simply and blindly carries out a certain kind of filtering based on
caller-provided strings:

def get_text_between_tags(data,tag)
return data.scan(%r{<#{tag}>(.*?)</#{tag}>})
end

If I call this function with a set of HTML data in "data" (containing an
HTML page) and a tag string like "td", this function will return an array
containing the text between each pair of <td> ... </td> tags in the data
string.

Note that this function will accept any data string whatsoever, and it will
also accept any search tag whatsoever.

Is this what you mean? Can you extrapolate this way of approaching the
problem to solve your own?

--
Paul Lutus
http://www.ara...

David Vallner

12/13/2006 8:49:00 PM

benjohn@fysh.org wrote:
> The book I'm reading (o'reilly pocket reference) hints at the look
> arround constructs being:
>
> (?=...) - look ahead.
> (?!...) - negated look ahead.

The following two aren't supported in the current Ruby regexp engine,
they are in the one Ruby 1.9 and on will use.

> (?<=...) - look behind.
> (?<!...) - negated look behind.
>
> So perhaps one of those is what you want?
>

Either way, it's possible to emulate positive lookbehinds by capturing
what would be the pre-match and putting it into the replacement:

string.sub(/(some lookbehind pattern)(what you're looking for)/) {
$1 + replacement_of($2)
}

instead of:

string.sub(/(?<=some lookbehind pattern)what you're looking for/) {
replacement_of($~.to_s)
}

and kludge negative lookbehinds by instead enumerating all the patterns
that would match in a positive one. They just make the pattern
(sometimes much) more elegant in most cases.

David Vallner

comp.lang.ruby

Restricted capture in Regexp

benjohn

Paul Lutus

Bertram Scharpf

benjohn

benjohn

Paul Lutus

benjohn

Simon Strandgaard

Simon Strandgaard

Paul Lutus

David Vallner

x Login to ForumsZone