Asp Forum - String.scan - catching overlapping patterns with lookahead

Ardanwen

12/16/2004 1:12:00 PM

Hi all.

I wrote a small script to count the number of occurrences of one string
in another string (including somewhat overlapping occurrences). Then I
found out about the .scan method, which speeded up some things, but
unfortunately introduced the 'banana problem' -> ("banana".scan("ana")
returns only one "ana")

(This was the script I had before i found out about scan:)
------
$proteasome.each {|x|
i = VIRUS #virus length
begin
i = virus[0,i + 4].rindex(x)
count_prot += 1 if i != nil
end until i == nil
}
------

proteasome is an array filled with 12 strings ("00010" , "00100" ,
"10110" etc,
virus is a long string of 1's and 0's (~3000 in total).

Initially, I changed the following, which speeded up the whole program
about 25% but suffered from the banana problem (most of the program's
energy goes into the above algorithm)

------
$proteasome.each {|x|
virus.scan(x) {
count_prot += 1
}
}
------

So I changed it into the following, and then optimized a little bit
(first did virus.scan(/(?=#{x})/ which is a little slower then what I'm
doing now)

------
$proteasome.each {|x|
virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
count_prot += 1
}
}
------

The sad thing is, the above is comparable in speed with the script I
had before I found out about the scan method. Did I miss any obvious
optimizations in the scanning/regexp method?

Thanks!,
Boris

6 Answers

Robert Klemme

12/16/2004 1:38:00 PM

"Ardanwen" <bvschmid@gmail.com> schrieb im Newsbeitrag
news:1103202699.721826.143490@f14g2000cwb.googlegroups.com...

> So I changed it into the following, and then optimized a little bit
> (first did virus.scan(/(?=#{x})/ which is a little slower then what I'm
> doing now)
>
> ------
> $proteasome.each {|x|
> virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
> count_prot += 1
> }
> }
> ------
>
> The sad thing is, the above is comparable in speed with the script I
> had before I found out about the scan method. Did I miss any obvious
> optimizations in the scanning/regexp method?

Do you have any metacharacters in the strings contained in $proteasome?
In that case you should use Regexp.escape.

Although it might be a small optimization, you could split your search
string at another position than the first. For example, if you search for
"foofii" then you can split after the second "o" without loosing anything.
In fact you could even skip the lookahead for "foofii" completely, because
if it matches the earliest next match is after the complete string.
However there's a certain overhead involved in finding this position so it
might not be worth it if the strings you search through are short. If you
search through huge piles of data then it might be worthwile.

See also http://en.wikipedia.org/wiki/String_searching...

Kind regards

robert

dblack

12/16/2004 1:55:00 PM

Robert Klemme

12/16/2004 2:39:00 PM

"David A. Black" <dblack@wobblini.net> schrieb im Newsbeitrag
news:Pine.LNX.4.61.0412160543050.23221@wobblini...
> On Thu, 16 Dec 2004, Ardanwen wrote:
>
> > Hi all.
> >
> > I wrote a small script to count the number of occurrences of one
string
> > in another string (including somewhat overlapping occurrences). Then I
> > found out about the .scan method, which speeded up some things, but
> > unfortunately introduced the 'banana problem' -> ("banana".scan("ana")
> > returns only one "ana")
> >
> >
> > (This was the script I had before i found out about scan:)
> > ------
> > $proteasome.each {|x|
> > i = VIRUS #virus length
> > begin
> > i = virus[0,i + 4].rindex(x)
> > count_prot += 1 if i != nil
> > end until i == nil
> > }
> > ------
> >
> > proteasome is an array filled with 12 strings ("00010" , "00100" ,
> > "10110" etc,
> > virus is a long string of 1's and 0's (~3000 in total).
> >
> >
> > Initially, I changed the following, which speeded up the whole program
> > about 25% but suffered from the banana problem (most of the program's
> > energy goes into the above algorithm)
> >
> > ------
> > $proteasome.each {|x|
> > virus.scan(x) {
> > count_prot += 1
> > }
> > }
> > ------
> >
> >
> > So I changed it into the following, and then optimized a little bit
> > (first did virus.scan(/(?=#{x})/ which is a little slower then what
I'm
> > doing now)
> >
> > ------
> > $proteasome.each {|x|
> > virus.scan(/#{x[0].chr}(?=#{x[1,4]})/) {
> > count_prot += 1
> > }
> > }
> > ------
> >
> > The sad thing is, the above is comparable in speed with the script I
> > had before I found out about the scan method. Did I miss any obvious
> > optimizations in the scanning/regexp method?
>
> Are you sure about the benchmarks? I got results that suggest that
> the plain scan version is much faster, assuming I've adapted your
> original code correctly:
>
> require 'benchmark'
> include Benchmark
>
> str = "bananaaananaaana" * 100
> n = 5000
>
> def boris(str)
> c = 0
> i = str.size
> begin
> i = str[0,i + 2].rindex("ana")
> c += 1 if i != nil
> end until i == nil
> c
> end
>
> def david(str)
> str.scan(/(?=ana)/).size
> end
>
> bm do |x|
> x.report { n.times { boris(str) } }
> x.report { n.times { david(str) } }
> end
>
> Results:
>
> $ ruby ban.rb
> user system total real
> 27.460000 0.490000 27.950000 ( 28.048449)
> 8.790000 0.020000 8.810000 ( 8.800876)
>
>
> David

As far as I remember he didn't do the size trick your used. That might be
one point.

robert

Ardanwen

12/16/2004 3:34:00 PM

First time trying to use ruby's benchmark module. Doing something
wrong,
----------
boris:falcon:~/SPEC:ruby bench-search.rb
user system total real
../benchmark.rb:148:in `times': undefined method `times' for
Process:Module (NameError)
----------
so I switched back to linux's "time bench-search.rb".

Re-rewrote your program to something that resembles mine more again,
and then
they're similarly slow again (n=50000).

boris:falcon:~/SPEC:time bench-search.rb 7.430u 0.060s 0:07.99 93.7%
0+0k 0+0io 359pf+0w (david)
boris:falcon:~/SPEC:time bench-search.rb 7.480u 0.010s 0:07.54 99.3%
0+0k 0+0io 348pf+0w (boris)

Not sure why you find something so different. Gutfeeling says it might
be because of your search size (3 chars instead of 5 chars, or the
specific search- and searched-string you use. Doing some testing right
now (n = 50000 though).

Now with a 5char nonchanging pattern, changing searched-string
[syntax: program pattern randomseed]
boris:falcon:~/SPEC:time boris.rb 11111 1 8.210u 0.140s 0:08.43 99.0%
boris:falcon:~/SPEC:time boris.rb 11111 1 8.320u 0.100s 0:08.57 98.2%
boris:falcon:~/SPEC:time boris.rb 11111 2 7.920u 0.030s 0:07.98 99.6%
boris:falcon:~/SPEC:time boris.rb 11111 2 7.980u 0.020s 0:07.99 100.1%
boris:falcon:~/SPEC:time boris.rb 11111 3 4.760u 0.010s 0:04.77 100.0%
boris:falcon:~/SPEC:time boris.rb 11111 3 4.770u 0.020s 0:04.82 99.3%
boris:falcon:~/SPEC:time david.rb 11111 1 7.780u 0.000s 0:07.79 99.8%
boris:falcon:~/SPEC:time david.rb 11111 1 7.840u 0.000s 0:07.91 99.1%
boris:falcon:~/SPEC:time david.rb 11111 2 7.530u 0.000s 0:07.58 99.3%
boris:falcon:~/SPEC:time david.rb 11111 2 7.440u 0.000s 0:07.51 99.0%
boris:falcon:~/SPEC:time david.rb 11111 3 5.450u 0.000s 0:05.42 100.5%
boris:falcon:~/SPEC:time david.rb 11111 3 5.510u 0.000s 0:05.60 98.3%

Seems like searched-string pattern matters. Now look at search-string.
Specific search string you use matters quite a lot.
boris:falcon:~/SPEC:time boris.rb 10101 1 9.520u 0.000s 0:09.52 100.0%
boris:falcon:~/SPEC:time boris.rb 10101 2 9.290u 0.010s 0:09.24 100.6%
boris:falcon:~/SPEC:time boris.rb 10101 3 8.620u 0.030s 0:08.70 99.4%
boris:falcon:~/SPEC:time david.rb 10101 1 9.420u 0.000s 0:09.51 99.0%
boris:falcon:~/SPEC:time david.rb 10101 2 8.710u 0.000s 0:08.78 99.2%
boris:falcon:~/SPEC:time david.rb 10101 3 8.540u 0.010s 0:08.79 97.2%

So not sure if that's enough to explain the difference between 1's and
0's and banana's. Another factor is the size of the search string, but
I grow a bit weary of testing :).

Anyway, thanks for testing. Time to look at that search-string link.

bench-search.rb
----------------------------------------
#! /usr/bin/env ruby
FIVE =
["00000","00001","00010","00011","00100","00101","00110","00111","01000","01001","01010","01011","01100","01101","01110","01111","10000","10001","10010","10011","10100","10101","10110","10111","11000","11001","11010","11011","11100","11101","11110","11111"]

str = ""
begin
str << "#{rand(2)}"
end until str.length == 1000

srand(1)

n = 50000

def boris(str)
pattern = FIVE[rand(32)]
c = 0
i = 1000
begin
i = str[0,i + 4].rindex(pattern)
c += 1 if i != nil
end until i == nil
c
end

def david(str)
pattern = FIVE[rand(32)]
str.scan(/(?=#{pattern})/).size
end

n.times { boris(str) }
#n.times { david(str) }
-----------------------------------------

boris.rb
------------------------
#! /usr/bin/env ruby

# This bit makes a string length 1000 filled with random 0's and 1's
srand(ARGV[1].to_i)
str = ""
begin
str << "#{rand(2)}"
end until str.length == 1000

# This is the method.
def boris(str)
c = 0
i = 1000
begin
i = str[0,i + 4].rindex(ARGV[0])
c += 1 if i != nil
end until i == nil
c
end

# Here we run it.
50000.times { boris(str) }
--------------------------

david.rb
------------
#! /usr/bin/env ruby

# This bit makes a string length 1000 filled with random 0's and 1's
srand(ARGV[1].to_i)
str = ""
begin
str << "#{rand(2)}"
end until str.length == 1000

# This is the method.
def david(str)
str.scan(/(?=#{ARGV[0]})/).size
end
# Here we run it.
50000.times { david(str) }
---------------------

Ardanwen

12/16/2004 3:54:00 PM

Ugh. I need to learn how to preserve indentation in google groups :P.
New to newsgroups in general, I'll go look for a faq. For google groups
users, the original (but sadly still unindented code) is under 'show
options' 'show original'. For some reason, a part changes and becomes
blue in the normal google view.

erimess

5/12/2007 10:47:00 PM

On Mon, 07 May 2007 06:27:56 GMT, someone@somewhere.com (Dalboz
Dragon) wrote:

>On 2 May 2007 12:40:34 -0700, Dalboz Dragon <beerstud948@yahoo.com>
>earned a wedgie by saying:
>
>
>After I told my supervisor, I think I might have had an anxiety
>attack. Suddenly, it all seems so real, that I'm leaving a fairly
>high-paying, albeit frustrating job to go back to school to do who
>knows what. And by announcing it, I pretty much passed the point of no
>return. But I still keep telling myself that I have not been happy
>where I'm at, so rather than wait for the next thing to come along,
>I'm actively trying to make a change. Hopefully, it will work out.

Well, supposedly the average person changes careers like 3 times in
their life, or some such. (I've already *kind* of changed twice,
though they are somewhat related.) Going into the unknown can be
spooky, but staying on an obvious wrong path is not good either. The
unknown can also be exciting.

I too don't make tons of money, but am much less stressed having work
with decent bosses (or in the case of my online work, a boss I've
never even met), and flexible hours and doing stuff I more enjoy
doing, usually. It's just not worth it to me to make a ton of money
but hate what I'm doing and be unhappy and stressed all the time.
Fortunately for me, I'm good at living cheap, managing money and
investing. :-) Sure, I have other concerns because of it, but what's
the point of living to a ripe old age if I hate it? I'm most
definitely happier this way.

One brother is a computer programmer, the other a circuit board
designer. Both do contract work, both are usually working for
engineers, both are on "projects." And both complain about just what
you have -- always thinking something can be done in half the time,
not allowing for changes and problems, etc. And one always
complaining that the marketing people create impossible expectations
in the mind of the clients. :-) (And usually expected to do tons of
overtime.) I could just never stand being in that type of situation.
In fact, one thing (though not really the main thing) that turns me
off about working at a CPAs is called tax season. :-)

In fact, my dad was an engineer (family full of geeks) and I know he
was really unhappy with work, though he never talked about it to me.
Very private person. I just know he didn't do tons of overtime cause
he was always in time for dinner. But he got told by a doctor to
retire early before he had a heart attack, cause he was so stressed by
work. I already have his high blood pressure -- I *don't* want to end
up like him. (Although he lived to 84 - pretty good.)
--

Erimess Dragon
-==(UDIC)==-

d++e+NT++Om UK!1!2!3!A!L!
U+uCuFuG+++uLB+uA+ nC+nH+nP+nS++nT-xa5

Never compare yourself to the best others can do,
but rather to the best you can do.

comp.lang.ruby

String.scan - catching overlapping patterns with lookahead

Ardanwen

Robert Klemme

dblack

Robert Klemme

Ardanwen

Ardanwen

erimess

x Login to ForumsZone