Asp Forum - gsub bug? - comp.lang.ruby

Shea Martin

5/21/2006 3:08:00 PM

I want to do this:
puts str

and have it print to the screen
C:\hardcoded\path, C:\\double\\escape\\path

To start with, I have
double_escape_path = "C:\\\\double\\\\escape\\\\path"
str = "C:\\hardcoded\\path, DOUBLE_ESCAPE_HERE"

Now do a gsub!
str.gsub!( 'DOUBLE_ESCAPE_HERE', double_escape_path )

The content, the problem is that gsub is also replacing my \\\\ with \\
for some reason.

Is this a bug?

Thanks.

10 Answers

Shea Martin

5/21/2006 3:10:00 PM

Here is a simplified version:

de_dir = "c:\\\\some\\\\dir"
str = "c:\\some\\path, PATTERN"
puts str.gsub( 'PATTERN', de_dir )

It should output
c:\some\path, c:\\some\\dir
but actually outputs
c:\some\path, c:\some\dir

Thanks,

~S

ts

5/21/2006 3:17:00 PM

>>>>> "S" == Shea Martin <shea08@eastlink.ca> writes:

Write it like this

S> puts str.gsub( 'PATTERN', de_dir )

puts str.gsub( 'PATTERN' ) { de_dir }

--

Guy Decoux

Shea Martin

5/21/2006 11:35:00 PM

ts wrote:
>>>>>> "S" == Shea Martin <shea08@eastlink.ca> writes:
>
> Write it like this
>
> S> puts str.gsub( 'PATTERN', de_dir )
>
> puts str.gsub( 'PATTERN' ) { de_dir }
>
>
Thanks. I thought I had tried, that, but I guess not.
BTW, is that a bug or designed behavior?

Thanks,
~S

Robert Klemme

5/22/2006 8:37:00 AM

Shea Martin wrote:
> ts wrote:
>>>>>>> "S" == Shea Martin <shea08@eastlink.ca> writes:
>>
>> Write it like this
>>
>> S> puts str.gsub( 'PATTERN', de_dir )
>>
>> puts str.gsub( 'PATTERN' ) { de_dir }

I recommend to rather double the number of backslashes in de_dir or use
Regexp.escape() because the block form is for dynamic replacements; it's
also slower for static replacements.

>> de_dir = "c:\\\\some\\\\dir"
=> "c:\\\\some\\\\dir"
>> str = "c:\\some\\path, PATTERN"
=> "c:\\some\\path, PATTERN"
>> puts str.gsub( 'PATTERN', de_dir )
c:\some\path, c:\some\dir
=> nil
>> puts str.gsub( 'PATTERN', Regexp.escape( de_dir ) )
c:\some\path, c:\\some\\dir

>> de_dir.gsub! /\\/, '\\\\\\\\'
=> "c:\\\\\\\\some\\\\\\\\dir"
>> puts str.gsub( 'PATTERN', de_dir )
c:\some\path, c:\\some\\dir
=> nil

> Thanks. I thought I had tried, that, but I guess not.
> BTW, is that a bug or designed behavior?

It's not a bug. People frequently stumble on this. There are simply
several layers of escaping: first there is the escaping for strings,
i.e. you need to escape a backslash in order to get it into a single
quoted or double quoted string. Then the RX engine uses backslash as
escape as well in substitution patterns. The reason why so many people
stumble here is that you can actually have \1 in a string which does not
result in 1 but in \1 which is a bit inconsistent (because \\ results in
\) but convenient for the usual case:

>> puts '\\'
=> nil
>> puts '\1'
\1
=> nil

So you can do

>> "foo".gsub /(.)/, '<\1>'
=> "<f><o><o>"

while you should IMHO be doing

>> "foo".gsub /(.)/, '<\\1>'
=> "<f><o><o>"

Cheers

robert

Shea Martin

5/22/2006 1:31:00 PM

Robert Klemme wrote:
> Shea Martin wrote:
>> ts wrote:
>>>>>>>> "S" == Shea Martin <shea08@eastlink.ca> writes:
>>>
>>> Write it like this
>>>
>>> S> puts str.gsub( 'PATTERN', de_dir )
>>>
>>> puts str.gsub( 'PATTERN' ) { de_dir }
>
> I recommend to rather double the number of backslashes in de_dir or use
> Regexp.escape() because the block form is for dynamic replacements; it's
> also slower for static replacements.
>

If I do this
Regexp.escape("C:\path\to\file.txt")
It also escapes the '.' dot char.

~S

Mike

5/22/2006 2:56:00 PM

Shea Martin wrote:
> ts wrote:
> >>>>>> "S" == Shea Martin <shea08@eastlink.ca> writes:
> >
> > Write it like this
> >
> > S> puts str.gsub( 'PATTERN', de_dir )
> >
> > puts str.gsub( 'PATTERN' ) { de_dir }
> >
> >
> Thanks. I thought I had tried, that, but I guess not.
> BTW, is that a bug or designed behavior?
>
> Thanks,
> ~S

this is covered in the PickAx book in the section starting <b>Backslash
Sequences in the Substitution</b>. According to that discussion,
in using the function call form of gsub, the regular expression engine
makes two passes; using the block form, it only makes one.

Each pass performs '//' -> '/', so to end up with '//', you need
to start with '////////' in the function call form.

I don't know if this is a bug, a feature, or a language curiousity.

ts

5/22/2006 3:07:00 PM

>>>>> "M" == Mike <mike@clove.com> writes:

M> I don't know if this is a bug, a feature, or a language curiousity.

Well, it give you the possibility to write

moulon% ruby -e "p 'PATTERN'.sub(/PAT(TER)N/, 'the first group is : \1')"
"the first group is : TER"
moulon%

this is why ruby need to parse the string to find \ sequence

--

Guy Decoux

Robert Klemme

5/22/2006 8:59:00 PM

Shea Martin <shea08@eastlink.ca> wrote:
> Robert Klemme wrote:
>> Shea Martin wrote:
>>> ts wrote:
>>>>>>>>> "S" == Shea Martin <shea08@eastlink.ca> writes:
>>>>
>>>> Write it like this
>>>>
>>>>> puts str.gsub( 'PATTERN', de_dir )
>>>>
>>>> puts str.gsub( 'PATTERN' ) { de_dir }
>>
>> I recommend to rather double the number of backslashes in de_dir or
>> use Regexp.escape() because the block form is for dynamic
>> replacements; it's also slower for static replacements.
>>
>
>
> If I do this
> Regexp.escape("C:\path\to\file.txt")
> It also escapes the '.' dot char.

Then use the other approach.

robert

Mike

5/23/2006 1:05:00 PM

Just to expand on this so I understand it:

What is happening is that 'ruby' parses the string when it compiles the
code: this strips one set of \\'s.

After that, if you use the gsub(pattern, replacement) form, the regular
expression engine parses the 'replacement' text: this strips the second
set. It has to do this in order to do regular expression substitutions
in the replacement text.

On the other hand, if you use the block form - gsub(pattern) { block }
- then the regular expression engine does NOT parse the content of the
block. As a result, the block CANNOT contain group substitutions, such
as \1, above.

To confirm, I ran: ruby -e "p 'PATTERN'.sub(/PAT(TER)N/) { 'the first
group is: \1' }"
and it dutifully printed out: '"he first group is \\1" - which is the
appropriate quoting for 'the first group is \1' when using double
quotes (")

ruby -e "p 'PATTERN'.sub(/PAT(TER)N/) { \:the first group is: \1'\"}"
prints out:
"the first group is \001" - which is who the block was parsed in the
syntax analysis.

So the reason this happens must be that the block is not in the scope
of the sub() function and the regular expression engine never sees it.

This seems like a fairly important characteristic of the language and
can lead to some farily subtle program errors.

Robert Klemme

5/23/2006 1:23:00 PM

Mike wrote:
> Just to expand on this so I understand it:
>
> What is happening is that 'ruby' parses the string when it compiles the
> code: this strips one set of \\'s.
>
> After that, if you use the gsub(pattern, replacement) form, the regular
> expression engine parses the 'replacement' text: this strips the second
> set. It has to do this in order to do regular expression substitutions
> in the replacement text.

Correct.

> On the other hand, if you use the block form - gsub(pattern) { block }
> - then the regular expression engine does NOT parse the content of the
> block. As a result, the block CANNOT contain group substitutions, such
> as \1, above.

This is wrong (see below).

> To confirm, I ran: ruby -e "p 'PATTERN'.sub(/PAT(TER)N/) { 'the first
> group is: \1' }"
> and it dutifully printed out: '"he first group is \\1" - which is the
> appropriate quoting for 'the first group is \1' when using double
> quotes (")
>
> ruby -e "p 'PATTERN'.sub(/PAT(TER)N/) { \:the first group is: \1'\"}"
> prints out:
> "the first group is \001" - which is who the block was parsed in the
> syntax analysis.
>
> So the reason this happens must be that the block is not in the scope
> of the sub() function and the regular expression engine never sees it.

No, that's wrong. Scope is not a topic here. You have to think of a
block as an anonymous function. The method you provide the block to can
invoke that function any number of times. With gsub it's the number of
matches, with sub it's at most once.

> This seems like a fairly important characteristic of the language and
> can lead to some farily subtle program errors.

You're almost there but not quite. The block is executed (i.e. called)
for every match. The result of the block is used for substitution. The
regexp engine never does any changes to the result of the block before
it does the replacement.

15:19:08 [~]: ruby -e 'c=0; puts "abcd".gsub(/(.)./) {|m|
"<#{$1}-#{c+=1}>"}'
<a-1><c-2>

As you can see the variable $1 holds the first group. Same for $2 etc.
When doing string substitution with global variables you can actually
use a shortcut:

15:19:20 [~]: ruby -e 'c=0; puts "abcd".gsub(/(.)./) {|m| "<#$1-#{c+=1}>"}'
<a-1><c-2>

Hope it's clearer now.

Regards

robert

comp.lang.ruby

gsub bug?

Shea Martin

Shea Martin

ts

Shea Martin

Robert Klemme

Shea Martin

Mike

ts

Robert Klemme

Mike

Robert Klemme

x Login to ForumsZone