Asp Forum - cannot remove multiple spaces

Tom Cloyd

2/7/2009 1:31:00 PM

I'm baffled by this strange outcome - I cannot reduce multiple spaces
from a text file. This isn't just a regex problem, somehow. I'm failing
to grasp something essential, but don't know what it is. All help
appreciated, as usual!

Here is a demo of my problem, in which I try two different ways, and
both fail:

=== code ===
# h2t.rb

def main
# conversion table spec
conv = [
[ '<h1>', 'h1. ' ], [ '<h2>', 'h2. ' ], [ '<h3>', 'h3. ' ],
[ '<h4>', 'h4. ' ], [ '<h5>', 'h5. ' ], [ '<h6>', 'h6. ' ], [
/<\/h\d>/, '' ],
[ " +", ' ' ]] # <= this last array element should do the trick, but
doesn't

data = open( 'h2t-in2.txt', 'r' ) { |f| ( f.readlines( data )).to_s }

conv.each do |i|
data.gsub!( i[0], i[1] )
end
data.squeeze(' ') # <= putting this here was sheer desperations, but
even THIS fails

open( "h2t-out.txt", "w" ) { |f| f.write( data ) }

end

%w(rubygems ruby-debug readline strscan logger fileutils).each{ |lib|
require lib }

main

=== input file ===

<h1>Library catalog listing </h1>x

<h3>Library catalog listing </h3>x

<h2>Library catalog listing </h2>x

p(subtitle). A complete listing of all material in the Library

=== output file ===

h1. Library catalog listing x

h3. Library catalog listing x

h2. Library catalog listing x

p(subtitle). A complete listing of all material in the Library

==============

The "x"s in the input file are to show that while the end tags are being
removed the space before them is NOT.

t.

--

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Cloyd, MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< tc@tomcloyd.com >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

15 Answers

Tom Cloyd

2/7/2009 1:42:00 PM

Tom Cloyd wrote:
> I'm baffled by this strange outcome - I cannot reduce multiple spaces
> from a text file. This isn't just a regex problem, somehow. I'm
> failing to grasp something essential, but don't know what it is. All
> help appreciated, as usual!
>
> Here is a demo of my problem, in which I try two different ways, and
> both fail:
>
> === code ===
> # h2t.rb
>
> def main
> # conversion table spec
> conv = [
> [ '<h1>', 'h1. ' ], [ '<h2>', 'h2. ' ], [ '<h3>', 'h3. ' ],
> [ '<h4>', 'h4. ' ], [ '<h5>', 'h5. ' ], [ '<h6>', 'h6. ' ], [
> /<\/h\d>/, '' ],
> [ " +", ' ' ]] # <= this last array element should do the trick, but
> doesn't
Ouch. THIS - [ / +/, ' ' ], substituted for [ " +", ' ' ] above fixes
it. I'm going blind, obviously.

t.
>
> data = open( 'h2t-in2.txt', 'r' ) { |f| ( f.readlines( data )).to_s }
>
> conv.each do |i|
> data.gsub!( i[0], i[1] )
> end
> data.squeeze(' ') # <= putting this here was sheer desperations, but
> even THIS fails
>
> open( "h2t-out.txt", "w" ) { |f| f.write( data ) }
>
> end
>
> %w(rubygems ruby-debug readline strscan logger fileutils).each{ |lib|
> require lib }
>
> main
>
> === input file ===
>
> <h1>Library catalog listing </h1>x
>
> <h3>Library catalog listing </h3>x
>
> <h2>Library catalog listing </h2>x
>
> p(subtitle). A complete listing of all material in the Library
>
>
> === output file ===
>
>
> h1. Library catalog listing x
>
> h3. Library catalog listing x
>
> h2. Library catalog listing x
>
> p(subtitle). A complete listing of all material in the Library
>
> ==============
>
> The "x"s in the input file are to show that while the end tags are
> being removed the space before them is NOT.
>
> t.
>

--

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Cloyd, MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< tc@tomcloyd.com >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

David A. Black

2/7/2009 1:49:00 PM

Hi --

On Sat, 7 Feb 2009, Tom Cloyd wrote:

> Tom Cloyd wrote:
>> I'm baffled by this strange outcome - I cannot reduce multiple spaces from
>> a text file. This isn't just a regex problem, somehow. I'm failing to grasp
>> something essential, but don't know what it is. All help appreciated, as
>> usual!
>>
>> Here is a demo of my problem, in which I try two different ways, and both
>> fail:
>>
>> === code ===
>> # h2t.rb
>>
>> def main
>> # conversion table spec
>> conv = [
>> [ '<h1>', 'h1. ' ], [ '<h2>', 'h2. ' ], [ '<h3>', 'h3. ' ],
>> [ '<h4>', 'h4. ' ], [ '<h5>', 'h5. ' ], [ '<h6>', 'h6. ' ], [ /<\/h\d>/,
>> '' ],
>> [ " +", ' ' ]] # <= this last array element should do the trick, but
>> doesn't
> Ouch. THIS - [ / +/, ' ' ], substituted for [ " +", ' ' ] above fixes it. I'm
> going blind, obviously.

Just for fun, here's another way to write the method:

def main
data = File.read("tom.txt")
data.gsub!(/<(h[1-6])>/, "\\1. ")
data.gsub!(/<\/h\d>/, "")
data.squeeze!(' ')

open("tom.out", "w") {|f| f.write(data) }

end

I think that does the same thing. Tweak to taste :-)

David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.r...
Coming in 2009: The Well-Grounded Rubyist (http://manning....)

http://www.wis... => Independent, social wishlist management!

Craig Demyanovich

2/7/2009 2:21:00 PM

[Note: parts of this message were removed to make it a legal post.]

See comments below.

On Sat, Feb 7, 2009 at 8:31 AM, Tom Cloyd <tomcloyd@comcast.net> wrote:

> I'm baffled by this strange outcome - I cannot reduce multiple spaces from
> a text file. This isn't just a regex problem, somehow. I'm failing to grasp
> something essential, but don't know what it is. All help appreciated, as
> usual!
>
> Here is a demo of my problem, in which I try two different ways, and both
> fail:
>
> === code ===
> # h2t.rb
>
> def main
> # conversion table spec
> conv = [
> [ '<h1>', 'h1. ' ], [ '<h2>', 'h2. ' ], [ '<h3>', 'h3. ' ],
> [ '<h4>', 'h4. ' ], [ '<h5>', 'h5. ' ], [ '<h6>', 'h6. ' ], [ /<\/h\d>/,
> '' ],
> [ " +", ' ' ]] # <= this last array element should do the trick, but
> doesn't

The last element means replace occurrences of a space followed by a plus
with an empty string. I assume that you were trying to write a regular
expression, which would make your last array [/ +/, ''].

data = open( 'h2t-in2.txt', 'r' ) { |f| ( f.readlines( data )).to_s }
>
> conv.each do |i|
> data.gsub!( i[0], i[1] )
> end
> data.squeeze(' ') # <= putting this here was sheer desperations, but even
> THIS fails

This does nothing because String#squeeze returns a new string that you don't
capture. Instead of using the array above, you could do

data = data.squeeze(' ')

> open( "h2t-out.txt", "w" ) { |f| f.write( data ) }
>
> end
>
> %w(rubygems ruby-debug readline strscan logger fileutils).each{ |lib|
> require lib }
>
> main

Hope that helps.

Regards,
Craig

Tom Cloyd

2/7/2009 8:50:00 PM

David A. Black wrote:
> Hi --
>
> On Sat, 7 Feb 2009, Tom Cloyd wrote:
>
>> Tom Cloyd wrote:
>>> I'm baffled by this strange outcome - I cannot reduce multiple
>>> spaces from a text file. This isn't just a regex problem, somehow.
>>> I'm failing to grasp something essential, but don't know what it is.
>>> All help appreciated, as usual!
>>>
>>> Here is a demo of my problem, in which I try two different ways, and
>>> both fail:
>>>
>>> === code ===
>>> # h2t.rb
>>>
>>> def main
>>> # conversion table spec
>>> conv = [
>>> [ '<h1>', 'h1. ' ], [ '<h2>', 'h2. ' ], [ '<h3>', 'h3. ' ],
>>> [ '<h4>', 'h4. ' ], [ '<h5>', 'h5. ' ], [ '<h6>', 'h6. ' ], [
>>> /<\/h\d>/, '' ],
>>> [ " +", ' ' ]] # <= this last array element should do the trick,
>>> but doesn't
>> Ouch. THIS - [ / +/, ' ' ], substituted for [ " +", ' ' ] above fixes
>> it. I'm going blind, obviously.
>
> Just for fun, here's another way to write the method:
>
> def main
> data = File.read("tom.txt")
> data.gsub!(/<(h[1-6])>/, "\\1. ")
> data.gsub!(/<\/h\d>/, "")
> data.squeeze!(' ')
>
> open("tom.out", "w") {|f| f.write(data) }
>
> end
>
> I think that does the same thing. Tweak to taste :-)
>
>
> David
>
That's beautifully economical, and reveals a far better grasp of regex
than I was able to attain last night. However, I'm having trouble with
this line:

data.gsub!(/<(h[1-6])>/, "\\1. ")

It certain works, but I don't grasp the "\\1. " part. I haven't yet
found anything that might shed light on this magic. How does it retain
the 'h' and whatever digit follows it? It looks somehow like "\\" ==
retain matched alpha, and the "1" does the same for matched digits, but
I really haven't a clue. Can you elucidate just a bit?

Thanks!

Tom

--

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Cloyd, MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< tc@tomcloyd.com >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

David A. Black

2/7/2009 8:54:00 PM

Hi --

On Sun, 8 Feb 2009, Tom Cloyd wrote:

> David A. Black wrote:
>> Hi --
>>
>> On Sat, 7 Feb 2009, Tom Cloyd wrote:
>>
>>> Tom Cloyd wrote:
>>>> I'm baffled by this strange outcome - I cannot reduce multiple spaces
>>>> from a text file. This isn't just a regex problem, somehow. I'm failing
>>>> to grasp something essential, but don't know what it is. All help
>>>> appreciated, as usual!
>>>>
>>>> Here is a demo of my problem, in which I try two different ways, and both
>>>> fail:
>>>>
>>>> === code ===
>>>> # h2t.rb
>>>>
>>>> def main
>>>> # conversion table spec
>>>> conv = [
>>>> [ '<h1>', 'h1. ' ], [ '<h2>', 'h2. ' ], [ '<h3>', 'h3. ' ],
>>>> [ '<h4>', 'h4. ' ], [ '<h5>', 'h5. ' ], [ '<h6>', 'h6. ' ], [
>>>> /<\/h\d>/, '' ],
>>>> [ " +", ' ' ]] # <= this last array element should do the trick, but
>>>> doesn't
>>> Ouch. THIS - [ / +/, ' ' ], substituted for [ " +", ' ' ] above fixes it.
>>> I'm going blind, obviously.
>>
>> Just for fun, here's another way to write the method:
>>
>> def main
>> data = File.read("tom.txt")
>> data.gsub!(/<(h[1-6])>/, "\\1. ")
>> data.gsub!(/<\/h\d>/, "")
>> data.squeeze!(' ')
>>
>> open("tom.out", "w") {|f| f.write(data) }
>>
>> end
>>
>> I think that does the same thing. Tweak to taste :-)
>>
>>
>> David
>>
> That's beautifully economical, and reveals a far better grasp of regex than I
> was able to attain last night. However, I'm having trouble with this line:
>
> data.gsub!(/<(h[1-6])>/, "\\1. ")
>
> It certain works, but I don't grasp the "\\1. " part. I haven't yet found
> anything that might shed light on this magic. How does it retain the 'h' and
> whatever digit follows it? It looks somehow like "\\" == retain matched
> alpha, and the "1" does the same for matched digits, but I really haven't a
> clue. Can you elucidate just a bit?

The \\1, \\2, etc. in the replacement string are pegged to the
parenthetical captures. "\\1. " means: the first capture (which is h
plus a digit), a period, and a space.

They work in single-quoted strings too, but there they're just \1, \2,
etc. There's some explanation in the ri docs for String#gsub.

David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.r...
Coming in 2009: The Well-Grounded Rubyist (http://manning....)

http://www.wis... => Independent, social wishlist management!

Jan-Erik R.

2/7/2009 8:54:00 PM

Tom Cloyd schrieb:
> David A. Black wrote:
>> Hi --
>>
>> On Sat, 7 Feb 2009, Tom Cloyd wrote:
>>
>>> Tom Cloyd wrote:
>>>> I'm baffled by this strange outcome - I cannot reduce multiple
>>>> spaces from a text file. This isn't just a regex problem, somehow.
>>>> I'm failing to grasp something essential, but don't know what it is.
>>>> All help appreciated, as usual!
>>>>
>>>> Here is a demo of my problem, in which I try two different ways, and
>>>> both fail:
>>>>
>>>> === code ===
>>>> # h2t.rb
>>>>
>>>> def main
>>>> # conversion table spec
>>>> conv = [
>>>> [ '<h1>', 'h1. ' ], [ '<h2>', 'h2. ' ], [ '<h3>', 'h3. ' ],
>>>> [ '<h4>', 'h4. ' ], [ '<h5>', 'h5. ' ], [ '<h6>', 'h6. ' ], [
>>>> /<\/h\d>/, '' ],
>>>> [ " +", ' ' ]] # <= this last array element should do the trick,
>>>> but doesn't
>>> Ouch. THIS - [ / +/, ' ' ], substituted for [ " +", ' ' ] above fixes
>>> it. I'm going blind, obviously.
>>
>> Just for fun, here's another way to write the method:
>>
>> def main
>> data = File.read("tom.txt")
>> data.gsub!(/<(h[1-6])>/, "\\1. ")
>> data.gsub!(/<\/h\d>/, "")
>> data.squeeze!(' ')
>>
>> open("tom.out", "w") {|f| f.write(data) }
>>
>> end
>>
>> I think that does the same thing. Tweak to taste :-)
>>
>>
>> David
>>
> That's beautifully economical, and reveals a far better grasp of regex
> than I was able to attain last night. However, I'm having trouble with
> this line:
>
> data.gsub!(/<(h[1-6])>/, "\\1. ")
>
> It certain works, but I don't grasp the "\\1. " part. I haven't yet
> found anything that might shed light on this magic. How does it retain
> the 'h' and whatever digit follows it? It looks somehow like "\\" ==
> retain matched alpha, and the "1" does the same for matched digits, but
> I really haven't a clue. Can you elucidate just a bit?
>
> Thanks!
>
> Tom
>
ah...regex! it's easy if you know them =D
the (...) in the Regex defines a group.
this group now includes the 'h' followed by one of the numbers 1,2,3,4,5
or 6
in the second parameter \1 (double slash because of
double-quotes/escaping ;) now is assgined to the matched pattern /h[1-6]/
that's it, nothing magic anymore ;)

Robert Klemme

2/7/2009 8:56:00 PM

On 07.02.2009 21:50, Tom Cloyd wrote:
> That's beautifully economical, and reveals a far better grasp of regex
> than I was able to attain last night. However, I'm having trouble with
> this line:
>
> data.gsub!(/<(h[1-6])>/, "\\1. ")
>
> It certain works, but I don't grasp the "\\1. " part. I haven't yet
> found anything that might shed light on this magic. How does it retain
> the 'h' and whatever digit follows it? It looks somehow like "\\" ==
> retain matched alpha, and the "1" does the same for matched digits, but
> I really haven't a clue. Can you elucidate just a bit?

The keyword is "capturing groups". Brackets in the regexp denote groups
of characters which can be referenced later via their numeric index as
you have seen. You can even use them for matching repetitions

/(fo+)\1/ =~ s # will match "fofo", "foofoo", "fooofooo" etc.

Cheers

robert

William James

2/7/2009 9:02:00 PM

Tom Cloyd wrote:

> I'm baffled by this strange outcome - I cannot reduce multiple spaces
> from a text file. This isn't just a regex problem, somehow. I'm
> failing to grasp something essential, but don't know what it is. All
> help appreciated, as usual!
>
> Here is a demo of my problem, in which I try two different ways, and
> both fail:
>
> === code ===
> # h2t.rb
>
> def main
> # conversion table spec
> conv = [
> [ '<h1>', 'h1. ' ], [ '<h2>', 'h2. ' ], [ '<h3>', 'h3. ' ],
> [ '<h4>', 'h4. ' ], [ '<h5>', 'h5. ' ], [ '<h6>', 'h6. ' ], [
> /<\/h\d>/, '' ],
> [ " +", ' ' ]] # <= this last array element should do the trick,
> but doesn't
>
> data = open( 'h2t-in2.txt', 'r' ) { |f| ( f.readlines( data )).to_s
> }
> conv.each do |i|
> data.gsub!( i[0], i[1] )
> end
> data.squeeze(' ') # <= putting this here was sheer desperations,
> but even THIS fails
>
> open( "h2t-out.txt", "w" ) { |f| f.write( data ) }
>
> end
>
> %w(rubygems ruby-debug readline strscan logger fileutils).each{ |lib|
> require lib }
>
> main
>
> === input file ===
>
> <h1>Library catalog listing </h1>x
>
> <h3>Library catalog listing </h3>x
>
> <h2>Library catalog listing </h2>x
>
> p(subtitle). A complete listing of all material in the Library
>
>
> === output file ===
>
>
> h1. Library catalog listing x
>
> h3. Library catalog listing x
>
> h2. Library catalog listing x
>
> p(subtitle). A complete listing of all material in the Library
>
> ==============
>
> The "x"s in the input file are to show that while the end tags are
> being removed the space before them is NOT.
>
> t.

puts IO.readlines("data2").map{|line|
line.sub( /<(h\d)>/, '\1. ' ).sub( /<\/h\d>/, "").
squeeze " " }

--- output ---

h1. Library catalog listing x

h3. Library catalog listing x

h2. Library catalog listing x

p(subtitle). A complete listing of all material in the Library

William James

2/7/2009 9:06:00 PM

William James wrote:

> Tom Cloyd wrote:
>
> > I'm baffled by this strange outcome - I cannot reduce multiple
> > spaces from a text file. This isn't just a regex problem, somehow.
> > I'm failing to grasp something essential, but don't know what it
> > is. All help appreciated, as usual!
> >
> > Here is a demo of my problem, in which I try two different ways,
> > and both fail:
> >
> > === code ===
> > # h2t.rb
> >
> > def main
> > # conversion table spec
> > conv = [
> > [ '<h1>', 'h1. ' ], [ '<h2>', 'h2. ' ], [ '<h3>', 'h3. ' ],
> > [ '<h4>', 'h4. ' ], [ '<h5>', 'h5. ' ], [ '<h6>', 'h6. ' ], [
> > /<\/h\d>/, '' ],
> > [ " +", ' ' ]] # <= this last array element should do the trick,
> > but doesn't
> >
> > data = open( 'h2t-in2.txt', 'r' ) { |f| ( f.readlines( data
> > )).to_s }
> > conv.each do |i|
> > data.gsub!( i[0], i[1] )
> > end
> > data.squeeze(' ') # <= putting this here was sheer desperations,
> > but even THIS fails
> >
> > open( "h2t-out.txt", "w" ) { |f| f.write( data ) }
> >
> > end
> >
> > %w(rubygems ruby-debug readline strscan logger fileutils).each{
> > |lib| require lib }
> >
> > main
> >
> > === input file ===
> >
> > <h1>Library catalog listing </h1>x
> >
> > <h3>Library catalog listing </h3>x
> >
> > <h2>Library catalog listing </h2>x
> >
> > p(subtitle). A complete listing of all material in the Library
> >
> >
> > === output file ===
> >
> >
> > h1. Library catalog listing x
> >
> > h3. Library catalog listing x
> >
> > h2. Library catalog listing x
> >
> > p(subtitle). A complete listing of all material in the Library
> >
> > ==============
> >
> > The "x"s in the input file are to show that while the end tags are
> > being removed the space before them is NOT.
> >
> > t.
>
> puts IO.readlines("data2").map{|line|
> line.sub( /<(h\d)>/, '\1. ' ).sub( /<\/h\d>/, "").
> squeeze " " }
>
> --- output ---
>
> h1. Library catalog listing x
>
> h3. Library catalog listing x
>
> h2. Library catalog listing x
>
> p(subtitle). A complete listing of all material in the Library

puts IO.read("data2").gsub( /<(h\d)>/, '\1. ' ).gsub( /<\/h\d>/, "").
squeeze " "

Tom Cloyd

2/7/2009 9:48:00 PM

Robert Klemme wrote:
> On 07.02.2009 21:50, Tom Cloyd wrote:
>> That's beautifully economical, and reveals a far better grasp of
>> regex than I was able to attain last night. However, I'm having
>> trouble with this line:
>>
>> data.gsub!(/<(h[1-6])>/, "\\1. ")
>>
>> It certain works, but I don't grasp the "\\1. " part. I haven't yet
>> found anything that might shed light on this magic. How does it
>> retain the 'h' and whatever digit follows it? It looks somehow like
>> "\\" == retain matched alpha, and the "1" does the same for matched
>> digits, but I really haven't a clue. Can you elucidate just a bit?
>
> The keyword is "capturing groups". Brackets in the regexp denote
> groups of characters which can be referenced later via their numeric
> index as you have seen. You can even use them for matching repetitions
>
> /(fo+)\1/ =~ s # will match "fofo", "foofoo", "fooofooo" etc.
>
> Cheers
>
> robert
>
>
David, badboy, Robert - thats to you all for the very clear
explanations. I really couldn't find info. about this (yet). It IS
clear, once the explanation's in had. I have to say that regex's
becoming rather fun, now that I'm getting a little control of it.

I continue to be amazed at the generosity of this list in helping the
real amateurs here move things along. We get that AND we get to listen
in on all sorts of amazing and mysterious discussions of higher order
magic. Pretty cool.

t.

--

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Cloyd, MS MA, LMHC - Private practice Psychotherapist
Bellingham, Washington, U.S.A: (360) 920-1226
<< tc@tomcloyd.com >> (email)
<< TomCloyd.com >> (website)
<< sleightmind.wordpress.com >> (mental health weblog)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

comp.lang.ruby

cannot remove multiple spaces

Tom Cloyd

Tom Cloyd

David A. Black

Craig Demyanovich

Tom Cloyd

David A. Black

Jan-Erik R.

Robert Klemme

William James

William James

Tom Cloyd

x Login to ForumsZone