Asp Forum - Tokenizing a large file

Don Wood

4/15/2009 4:48:00 PM

I have a large file that I need to tokenize. The method I am using now
is fast, but eats up a ton of memory by reading in the entire file first
as a String. I would also like to reuse existing tokens for duplicates.
(I have no control over the file format, but this Regex works well for
what I need.)

Here is what I am doing today.

tokens= File.read(filename).scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/)

And here is what I would like to do.

tokens= []
File.open(filename) do |fh|
fh.scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/) do |token|
tokens << i=tokens.index(token) ? tokens[i] : token
end
end

So what I would like to have is a scan method for File objects that
yields the tokens when called with a block, instead of returning an
array. (It would be nice if String#scan could do this as well.) This
isnâ??t a big issue, it just causes my machine to overflow to the swap
file periodically. I could easily fix that with a couple DIMMs, but I
canâ??t help thinking that there should be a better way.
--
Posted via http://www.ruby-....

8 Answers

Eric Hodel

4/15/2009 9:18:00 PM

On Apr 15, 2009, at 09:48, Don Wood wrote:
> I have a large file that I need to tokenize. The method I am using =20=

> now
> is fast, but eats up a ton of memory by reading in the entire file =20
> first
> as a String. I would also like to reuse existing tokens for =20
> duplicates.
> (I have no control over the file format, but this Regex works well for
> what I need.)
>
> Here is what I am doing today.
>
> tokens=3D File.read(filename).scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/)
>
> And here is what I would like to do.
>
> tokens=3D []
> File.open(filename) do |fh|
> fh.scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/) do |token|
> tokens << i=3Dtokens.index(token) ? tokens[i] : token
> end
> end
>
> So what I would like to have is a scan method for File objects that
> yields the tokens when called with a block, instead of returning an
> array. (It would be nice if String#scan could do this as well.) This
> isn=92t a big issue, it just causes my machine to overflow to the swap
> file periodically. I could easily fix that with a couple DIMMs, but I
> can=92t help thinking that there should be a better way.

You should look at StringScanner in strscan.rb, it'll allow you to =20
intern your tokens like you want.=

Joel VanderWerf

4/15/2009 9:44:00 PM

Eric Hodel wrote:
> On Apr 15, 2009, at 09:48, Don Wood wrote:
>> I have a large file that I need to tokenize. The method I am using now
>> is fast, but eats up a ton of memory by reading in the entire file first
>> as a String. I would also like to reuse existing tokens for duplicates.
>> (I have no control over the file format, but this Regex works well for
>> what I need.)
>>
>> Here is what I am doing today.
>>
>> tokens= File.read(filename).scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/)
>>
>> And here is what I would like to do.
>>
>> tokens= []
>> File.open(filename) do |fh|
>> fh.scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/) do |token|
>> tokens << i=tokens.index(token) ? tokens[i] : token
>> end
>> end
>>
>> So what I would like to have is a scan method for File objects that
>> yields the tokens when called with a block, instead of returning an
>> array. (It would be nice if String#scan could do this as well.) This
>> isn?t a big issue, it just causes my machine to overflow to the swap
>> file periodically. I could easily fix that with a couple DIMMs, but I
>> can?t help thinking that there should be a better way.
>
> You should look at StringScanner in strscan.rb, it'll allow you to
> intern your tokens like you want.

I was going to suggest that, but:

$ irb -r strscan
irb(main):001:0> StringScanner.new(File.open('tmp/t'))
TypeError: can't convert File into String
from (irb):1:in `initialize'
from (irb):1:in `new'
from (irb):1

Is there some way to use StringScanner with an open file?

(also, my ruby 1.8.6 only comes with ext/strscan, not lib/strscan.rb...
maybe we're talking about different things)

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Caleb Clausen

4/16/2009 4:04:00 AM

On 4/15/09, Don Wood <dwood@biped.us> wrote:
> I have a large file that I need to tokenize. The method I am using now
> is fast, but eats up a ton of memory by reading in the entire file first
> as a String. I would also like to reuse existing tokens for duplicates.
> (I have no control over the file format, but this Regex works well for
> what I need.)
>
> Here is what I am doing today.
>
> tokens=3D File.read(filename).scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/)
>
> And here is what I would like to do.
>
> tokens=3D []
> File.open(filename) do |fh|
> fh.scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/) do |token|
> tokens << i=3Dtokens.index(token) ? tokens[i] : token
> end
> end
>
> So what I would like to have is a scan method for File objects that
> yields the tokens when called with a block, instead of returning an
> array. (It would be nice if String#scan could do this as well.) This
> isn=92t a big issue, it just causes my machine to overflow to the swap
> file periodically. I could easily fix that with a couple DIMMs, but I
> can=92t help thinking that there should be a better way.

The sequence gem permits scanning a file directly with a regexp.
Something like this should work:

require 'rubygems'
require 'sequence'
require 'sequence/file'
tokens=3D []
fh=3DSequence::File.new(open(filename))
until fh.eof?
tokens<<fh.scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/) #or yield token
up to the caller...
fh.scan "\n"
end
fh.close

As I don't know your data format, I'm not sure if this is right. I'm
assuming that your tokens are separated by newlines, but if it's more
complicated than that, you will have to fiddle with the argument to
the 2nd scan. (As Sequence doesn't have String#scan's bump-a-long
behavior, you have to explicitly match the things between scanned
patterns yourself.)

Note that Sequence::File#scan will match patterns only up to a certain
size (4k bytes, I think). This is an inevitable consequence of using a
Regexp against a file; you wouldn't want arbitrary amounts of
backtracking in a 1GB+ file. Java had this restriction as well, last
time I knew (several years ago).

On the other hand, if you really do have one token per line, it will
be simpler and probably faster to use #readline to get tokens one by
one and no special library is needed.

Joel: I think the original ruby implementation of strscan was replaced
by a c extension long ago.

Robert Klemme

4/16/2009 9:28:00 AM

2009/4/15 Don Wood <dwood@biped.us>:
> I have a large file that I need to tokenize. =A0The method I am using now
> is fast, but eats up a ton of memory by reading in the entire file first
> as a String. =A0I would also like to reuse existing tokens for duplicates=

Ryan Davis

4/16/2009 9:34:00 AM

On Apr 16, 2009, at 02:27 , Robert Klemme wrote:

> Converted to the block form:
>
> def my_tokenize file
> tokens = Hash.new {|h,k| h[k.freeze] = k}

FYI:

% irb
>> h = {}
=> {}
>> h["key"] = 42
=> 42
>> h.keys.map { |k| k.frozen? }
=> [true]

hashes dupe and freeze string keys to prevent them from being mutated
while hash keys.

Robert Klemme

4/16/2009 9:46:00 AM

2009/4/16 Ryan Davis <ryand-ruby@zenspider.com>:
>
> On Apr 16, 2009, at 02:27 , Robert Klemme wrote:
>
>> Converted to the block form:
>>
>> def my_tokenize file
>> =A0tokens =3D Hash.new {|h,k| h[k.freeze] =3D k}
>
> FYI:
>
> % irb
>>> h =3D {}
> =3D> {}
>>> h["key"] =3D 42
> =3D> 42
>>> h.keys.map { |k| k.frozen? }
> =3D> [true]
>
> hashes dupe and freeze string keys to prevent them from being mutated whi=
le
> hash keys.

Only if they are not frozen yet.

irb(main):001:0> h =3D {}
=3D> {}
irb(main):002:0> s =3D "abc"
=3D> "abc"
irb(main):003:0> h[s] =3D s
=3D> "abc"
irb(main):004:0> s =3D "bar".freeze
=3D> "bar"
irb(main):005:0> h[s] =3D s
=3D> "bar"
irb(main):006:0> h
=3D> {"abc"=3D>"abc", "bar"=3D>"bar"}
irb(main):007:0> h.each {|kv| p kv.map {|x| x.object_id}}
[134954550, 134972840]
[134951170, 134951170]
=3D> {"abc"=3D>"abc", "bar"=3D>"bar"}

Do you now know why I did it the way I did?

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestprac...

Don Wood

4/16/2009 5:32:00 PM

Caleb Clausen wrote:
> On 4/15/09, Don Wood <dwood@biped.us> wrote:
>> And here is what I would like to do.
>> array. (It would be nice if String#scan could do this as well.) This
>> isnï¿½t a big issue, it just causes my machine to overflow to the swap
>> file periodically. I could easily fix that with a couple DIMMs, but I
>> canï¿½t help thinking that there should be a better way.
>
> The sequence gem permits scanning a file directly with a regexp.
> Something like this should work:
>
> require 'rubygems'
> require 'sequence'
> require 'sequence/file'
> tokens= []
> fh=Sequence::File.new(open(filename))
> until fh.eof?
> tokens<<fh.scan(/'[^']*'|"[^"]*"|[(:)]|[^(:)\s]+/) #or yield token
> up to the caller...
> fh.scan "\n"
> end
> fh.close
>
> As I don't know your data format, I'm not sure if this is right. I'm
> assuming that your tokens are separated by newlines, but if it's more
> complicated than that, you will have to fiddle with the argument to
> the 2nd scan. (As Sequence doesn't have String#scan's bump-a-long
> behavior, you have to explicitly match the things between scanned
> patterns yourself.)
>
> Note that Sequence::File#scan will match patterns only up to a certain
> size (4k bytes, I think). This is an inevitable consequence of using a
> Regexp against a file; you wouldn't want arbitrary amounts of
> backtracking in a 1GB+ file. Java had this restriction as well, last
> time I knew (several years ago).

Thanks Caleb,

This looks like exactly what I needed. I'm not sure I understand the
point of the second scan though. The first scan should already ignore
unquoted whitespace, including "\n". (At least that is how it currently
works when I scan a string.) I don't think that I will get anywhere
near the per-token 4k limit.
--
Posted via http://www.ruby-....

Don Wood

4/16/2009 5:40:00 PM

Robert Klemme wrote:
> 2009/4/16 Ryan Davis <ryand-ruby@zenspider.com>:
>> % irb
>>>> h = {}
>> => {}
>>>> h["key"] = 42
>> => 42
>>>> h.keys.map { |k| k.frozen? }
>> => [true]
>>
>> hashes dupe and freeze string keys to prevent them from being mutated while
>> hash keys.
>
> Only if they are not frozen yet.
>
> irb(main):001:0> h = {}
> => {}
> irb(main):002:0> s = "abc"
> => "abc"
> irb(main):003:0> h[s] = s
> => "abc"
> irb(main):004:0> s = "bar".freeze
> => "bar"
> irb(main):005:0> h[s] = s
> => "bar"
> irb(main):006:0> h
> => {"abc"=>"abc", "bar"=>"bar"}
> irb(main):007:0> h.each {|kv| p kv.map {|x| x.object_id}}
> [134954550, 134972840]
> [134951170, 134951170]
> => {"abc"=>"abc", "bar"=>"bar"}
>
> Do you now know why I did it the way I did?
>
> Cheers
>
> robert

Thanks Robert,

I see what you did there. This looks like the perfect solution for
finding duplicate strings quickly. I don't want to assume that tokens
don't span lines but, combining this with Caleb's suggestion of using
the sequence gem, I should have all I need to drastically cut my memory
footprint.
--
Posted via http://www.ruby-....

comp.lang.ruby

Tokenizing a large file

Don Wood

Eric Hodel

Joel VanderWerf

Caleb Clausen

Robert Klemme

Ryan Davis

Robert Klemme

Don Wood

Don Wood

x Login to ForumsZone