georgesawyer
1/12/2005 6:01:00 AM
Martin Pirker <crf@sbox.tu-graz.ac.at> Jan 10, 2005 at 12:49 AM wrote:
>given: String of several Mb. Problem: find the lines in String containing
"xyz".
This seems to be a real-world problem, and it seems the 'String of several
Mb' is relatively fixed. If so, here are some ideas.
The balance of resource consumption for different search preparations can
vary, depending on method.
One efficient preparation is sectioning the long string. Choose a size of
"bit array" to indicate some presence in each section of a given character.
To conserve resources, fold the characters together: for example, lower-
and upper-case together, all punctuation and whitespace together, and all
the rest together, and digits make 38 folded characters (nothing special
about that number). Fold characters also when searching. If you choose 8 K
bytes, the product is 304 Kb, which may be a reasonable companion for a 4
Mb string. Said string uses 22 bits to indicate a location; subtracting 13
bits distinguishes among 64 K sections (1 for each bit in 8 K bytes)
leaving 9 bits and a 512-byte section size. A 3-character search string
will span a boundary less than four-fifths of 1% of instances on average.
'AND'-together the section-indicating bit-arrays for each character of the
short, search string, and search only in the resulting sections.
The spanning section boundaries problem is solved mostly by a character
frequency count of the 'String of several Mb', which only uses 38
integers. The rarest character of a short, search string will have a
presence in some number of sections. Its frequency count may exceed that:
call it F. Along the way, lacking F hits (or the least of any character,
similarly), the remaining few instances each will span a boundary, putting
the last character into the next section, or the first into the previous
section. Finding this is another bitwise-'AND' operation.
The bit-arrays can be stored in strings and loaded four bytes at a time
with String#unpack 'V'.
After finding all the hits, eliminate the ones spuriously generated by
character folding.
For 'xyz', this goes through 2 K loads of three words, 'AND'-ed together,
mostly zero. Then it searches the indicated sections using the fastest
system routine; contiguous sections can be combined. This would seem
fast. Only a (Ruby) line or two of the last bit-pulling loop could be
written appropriately in C.