Caleb Clausen
4/16/2009 4:04:00 AM
On 4/15/09, Don Wood <dwood@biped.us> wrote:
> I have a large file that I need to tokenize. The method I am using now
> is fast, but eats up a ton of memory by reading in the entire file first
> as a String. I would also like to reuse existing tokens for duplicates.
> (I have no control over the file format, but this Regex works well for
> what I need.)
>
> Here is what I am doing today.
>
> tokens=3D File.read(filename).scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/)
>
> And here is what I would like to do.
>
> tokens=3D []
> File.open(filename) do |fh|
> fh.scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/) do |token|
> tokens << i=3Dtokens.index(token) ? tokens[i] : token
> end
> end
>
> So what I would like to have is a scan method for File objects that
> yields the tokens when called with a block, instead of returning an
> array. (It would be nice if String#scan could do this as well.) This
> isn=92t a big issue, it just causes my machine to overflow to the swap
> file periodically. I could easily fix that with a couple DIMMs, but I
> can=92t help thinking that there should be a better way.
The sequence gem permits scanning a file directly with a regexp.
Something like this should work:
require 'rubygems'
require 'sequence'
require 'sequence/file'
tokens=3D []
fh=3DSequence::File.new(open(filename))
until fh.eof?
tokens<<fh.scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/) #or yield token
up to the caller...
fh.scan "\n"
end
fh.close
As I don't know your data format, I'm not sure if this is right. I'm
assuming that your tokens are separated by newlines, but if it's more
complicated than that, you will have to fiddle with the argument to
the 2nd scan. (As Sequence doesn't have String#scan's bump-a-long
behavior, you have to explicitly match the things between scanned
patterns yourself.)
Note that Sequence::File#scan will match patterns only up to a certain
size (4k bytes, I think). This is an inevitable consequence of using a
Regexp against a file; you wouldn't want arbitrary amounts of
backtracking in a 1GB+ file. Java had this restriction as well, last
time I knew (several years ago).
On the other hand, if you really do have one token per line, it will
be simpler and probably faster to use #readline to get tokens one by
one and no special library is needed.
Joel: I think the original ruby implementation of strscan was replaced
by a c extension long ago.