Asp Forum - brute force string search

Martin Pirker

1/10/2005 12:50:00 AM

Hi...

given: String of several Mb
problem: find the lines in String containing "xyz"

Idea 1:
String.scan(/.*xyz.*/) -> ~10s runtime

Idea 2:
String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)

Idea 3:
loop String.index("xyz",lastmatch+3)
loop results array and grep in match area for line
-> 0,5s

C extension the only faster option left? :-)

Martin

10 Answers

Bertram Scharpf

1/10/2005 3:32:00 AM

Hi,

Am Montag, 10. Jan 2005, 09:51:22 +0900 schrieb Martin Pirker:
> given: String of several Mb
> problem: find the lines in String containing "xyz"
>
>
> Idea 1:
> String.scan(/.*xyz.*/) -> ~10s runtime
>
> Idea 2:
> String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)
>
> Idea 3:
> loop String.index("xyz",lastmatch+3)
> loop results array and grep in match area for line
> -> 0,5s
>
> C extension the only faster option left? :-)

What about this:

str = File.new( 'longfile').read
str.scan /.*/ do |x| puts x if x =~ /xyz/ end

Bertram

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-...

Bill Atkins

1/10/2005 3:38:00 AM

On Mon, 10 Jan 2005 12:32:03 +0900, Bertram Scharpf
<lists@bertram-scharpf.de> wrote:
> What about this:
>
> str = File.new( 'longfile').read
> str.scan /.*/ do |x| puts x if x =~ /xyz/ end

How is that any different from Idea 1?

To the OP, 0.5 seconds seems pretty fast to me. I doubt a C extension
would make much difference since index is already written in C and
probably already highly optimized.

Bill

--
$stdout.sync = true
"Just another Ruby hacker.".each_byte do |b|
('a'..'z').step do|c|print c+"\b";sleep 0.007 end;print b.chr
end; print "\n"

Charles Mills

1/10/2005 6:12:00 AM

Martin Pirker wrote:
> Hi...
>
> given: String of several Mb
> problem: find the lines in String containing "xyz"
>
>
> Idea 1:
> String.scan(/.*xyz.*/) -> ~10s runtime
>
> Idea 2:
> String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)
>
> Idea 3:
> loop String.index("xyz",lastmatch+3)
> loop results array and grep in match area for line
> -> 0,5s
>
Try the StringScanner library. It is very fast.
###
require 'strscan'

scanner = StringScanner.new(str, false)
match = scanner.scan(/.*xyz.*/)
###
you will have to change your regexp in order to skip lines not
containing 'xyz' since it always matches from the current position in
the string.
Docs: http://i.loveruby.net/en/prog/st...
If you just want to count the number of lines you could use the regexp
look ahead and look behind instead of '.*'.

>
> C extension the only faster option left? :-)
>

It is an option, but it might not be much faster.

-Charlie

Robert Klemme

1/10/2005 8:19:00 AM

"Martin Pirker" <crf@sbox.tu-graz.ac.at> schrieb im Newsbeitrag
news:41e1d123$0$11610$3b214f66@aconews.univie.ac.at...
> Hi...
>
> given: String of several Mb
> problem: find the lines in String containing "xyz"
>
>
> Idea 1:
> String.scan(/.*xyz.*/) -> ~10s runtime

Did you try the anchored version? Did it make a difference?

>> s = "a\nbxyz\ncxyz\nd"
=> "a\nbxyz\ncxyz\nd"
>> s.scan(/^.*xyz.*$/)
=> ["bxyz", "cxyz"]

Did you try the block form, i.e.,

s.scan(/^.*xyz.*$/) {|m| # work with m}

> Idea 2:
> String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)
>
> Idea 3:
> loop String.index("xyz",lastmatch+3)
> loop results array and grep in match area for line
> -> 0,5s
>
>
> C extension the only faster option left? :-)

Im not sure whether you will gain much, because certain things will have
to be done anyway: allocation of the result array, of the match strings
and expansion of the result array.

Kind regards

robert

Robert Klemme

1/10/2005 8:45:00 AM

"Robert Klemme" <bob.news@gmx.net> schrieb im Newsbeitrag
news:34es3pF48qpjkU1@individual.net...
>
> "Martin Pirker" <crf@sbox.tu-graz.ac.at> schrieb im Newsbeitrag
> news:41e1d123$0$11610$3b214f66@aconews.univie.ac.at...
> > Hi...
> >
> > given: String of several Mb
> > problem: find the lines in String containing "xyz"
> >
> >
> > Idea 1:
> > String.scan(/.*xyz.*/) -> ~10s runtime
>
> Did you try the anchored version? Did it make a difference?
>
> >> s = "a\nbxyz\ncxyz\nd"
> => "a\nbxyz\ncxyz\nd"
> >> s.scan(/^.*xyz.*$/)
> => ["bxyz", "cxyz"]
>
> Did you try the block form, i.e.,
>
> s.scan(/^.*xyz.*$/) {|m| # work with m}
>
> > Idea 2:
> > String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)
> >
> > Idea 3:
> > loop String.index("xyz",lastmatch+3)
> > loop results array and grep in match area for line
> > -> 0,5s
> >
> >
> > C extension the only faster option left? :-)
>
> Im not sure whether you will gain much, because certain things will have
> to be done anyway: allocation of the result array, of the match strings
> and expansion of the result array.
>
> Kind regards
>
> robert
>

Here are other options:

>> s.inject([]){|ar,x| (x.chomp! ; ar << x) if /xyz/ =~ x; ar}
=> ["bxyz", "cxyz"]
>> s.inject([]){|ar,x| (x.chomp! ; ar << x) if x.include? "xyz"; ar}
=> ["bxyz", "cxyz"]

Cheers

robert

Michael C. Libby

1/10/2005 12:55:00 PM

On Mon, 2005-01-10 at 09:51 +0900, Martin Pirker wrote:
> Hi...
>
> given: String of several Mb
> problem: find the lines in String containing "xyz"
>
>
> Idea 1:
> String.scan(/.*xyz.*/) -> ~10s runtime
>
> Idea 2:
> String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)
>
> Idea 3:
> loop String.index("xyz",lastmatch+3)
> loop results array and grep in match area for line
> -> 0,5s

Did you try something like this?

irb(main):003:0> big_string.each do |x| puts x if x=~/xyz/ end
eeexyzfff
qqqqxyzqqq
=> "abc\neeexyzfff\naadfwe\nqqqqxyzqqq\n"
irb(main):004:0> big_string.split.grep(/xyz/)
=> ["eeexyzfff", "qqqqxyzqqq"]
irb(main):005:0>

Intuitively it seems to me that wildcards (especially unanchored) in REs
will slow them down.

-Michael

Stefan Lang

1/10/2005 4:50:00 PM

Michael C. Libby wrote:

> On Mon, 2005-01-10 at 09:51 +0900, Martin Pirker wrote:
>> Hi...
>>
>> given: String of several Mb
>> problem: find the lines in String containing "xyz"
>>
>>
>> Idea 1:
>> String.scan(/.*xyz.*/) -> ~10s runtime
>>
>> Idea 2:
>> String.grep(/.*xyz.*/) -> ~3s (but gives the \n too)
>>
>> Idea 3:
>> loop String.index("xyz",lastmatch+3)
>> loop results array and grep in match area for line
>> -> 0,5s
>
> Did you try something like this?
>
> irb(main):003:0> big_string.each do |x| puts x if x=~/xyz/ end

Perhaps faster with: ... if x.include? "xyz"

> eeexyzfff
> qqqqxyzqqq
> => "abc\neeexyzfff\naadfwe\nqqqqxyzqqq\n"
> irb(main):004:0> big_string.split.grep(/xyz/)
> => ["eeexyzfff", "qqqqxyzqqq"]
> irb(main):005:0>
>
> Intuitively it seems to me that wildcards (especially unanchored) in REs
> will slow them down.
>
> -Michael

--
Stefan

georgesawyer

1/12/2005 6:01:00 AM

Martin Pirker <crf@sbox.tu-graz.ac.at> Jan 10, 2005 at 12:49 AM wrote:
>given: String of several Mb. Problem: find the lines in String containing
"xyz".

This seems to be a real-world problem, and it seems the 'String of several
Mb' is relatively fixed. If so, here are some ideas.

The balance of resource consumption for different search preparations can
vary, depending on method.

One efficient preparation is sectioning the long string. Choose a size of
"bit array" to indicate some presence in each section of a given character.
To conserve resources, fold the characters together: for example, lower-
and upper-case together, all punctuation and whitespace together, and all
the rest together, and digits make 38 folded characters (nothing special
about that number). Fold characters also when searching. If you choose 8 K
bytes, the product is 304 Kb, which may be a reasonable companion for a 4
Mb string. Said string uses 22 bits to indicate a location; subtracting 13
bits distinguishes among 64 K sections (1 for each bit in 8 K bytes)
leaving 9 bits and a 512-byte section size. A 3-character search string
will span a boundary less than four-fifths of 1% of instances on average.

'AND'-together the section-indicating bit-arrays for each character of the
short, search string, and search only in the resulting sections.

The spanning section boundaries problem is solved mostly by a character
frequency count of the 'String of several Mb', which only uses 38
integers. The rarest character of a short, search string will have a
presence in some number of sections. Its frequency count may exceed that:
call it F. Along the way, lacking F hits (or the least of any character,
similarly), the remaining few instances each will span a boundary, putting
the last character into the next section, or the first into the previous
section. Finding this is another bitwise-'AND' operation.

The bit-arrays can be stored in strings and loaded four bytes at a time
with String#unpack 'V'.

After finding all the hits, eliminate the ones spuriously generated by
character folding.

For 'xyz', this goes through 2 K loads of three words, 'AND'-ed together,
mostly zero. Then it searches the indicated sections using the fastest
system routine; contiguous sections can be combined. This would seem
fast. Only a (Ruby) line or two of the last bit-pulling loop could be
written appropriately in C.

Martin Pirker

1/12/2005 10:28:00 PM

georgesawyer <georgesawyer2@hotmail.ok.nospam.com> wrote:
> Martin Pirker <crf@sbox.tu-graz.ac.at> Jan 10, 2005 at 12:49 AM wrote:
>>given: String of several Mb. Problem: find the lines in String containing
> "xyz".
>
> This seems to be a real-world problem, and it seems the 'String of several
> Mb' is relatively fixed. If so, here are some ideas.
[fascinating read]

This is why even when one finds a "good enough" solution it's still a good
idea to bounce the solution to other people - one never really knows
what useful comments come back :-)

Slicing the data up gives an "overlap" problem.

After some thinking however it's possible to do some presorting of the
data, so a full search must be done at the beginning, but next findings
should be located in area around of last search hit.
e.g. Cutting the search from 100% of a String to specific 1% gives 100
fold improvement -> good :-)

How to do that?
String.index accepts a offset to start from
String.rindex accepts a offset to end with.

So if e.g. I want to search 1Mb in the middle of a 100Mb String i need
a String.yetanotherindex("xyz",startoffset,endoffset), otherwise
performance sucks again.

Looking at string.c of Ruby, .index and .rindex appear to be
quite simple brute-force (that's why they are so fast)

I RFC adding an .index function to String which accepts 3 parameters:
(searchitem, startoffset, endoffset)
IMHO this would make an useful addition to stdlib shipped with Ruby.

Martin

georgesawyer

1/13/2005 5:27:00 AM

Martin Pirker <crf@sbox.tu-graz.ac.at> Jan 12, 2005 at 10:28 PM wrote:
>what useful comments come back :-)

Thanks!

>possible to do some presorting of the data ... Cutting the search from
100% of a String to specific 1% ...

64 K (number of) sections gave 0.0015 %, but if it cannot be done, too bad
...

>How to do that? ... search 1Mb in the middle of a 100Mb String i need a
String#yetanotherindex( "xyz" ,startoffset ,endoffset), otherwise
performance sucks again. I RFC adding an index function to String which
accepts 3 parameters: (searchitem, startoffset, endoffset)

One can make in advance an array (even if not 64 thousand element-) of
512-byte strings, sidestepping the need this time.

With warm regard,

comp.lang.ruby

brute force string search

Martin Pirker

Bertram Scharpf

Bill Atkins

Charles Mills

Robert Klemme

Robert Klemme

Michael C. Libby

Stefan Lang

georgesawyer

Martin Pirker

georgesawyer

x Login to ForumsZone