Markus
10/6/2004 4:00:00 AM
On Tue, 2004-10-05 at 14:12, Robert Feldt wrote:
> Hi,
>
> I need a simple way to have a potentially very large hash mapping
> Strings to Ruby objects. It needs to scale to on the order of 1 million
> entries with up to 1kb of data per entry (even though it might end up
> with maybe 2-300 bytes per entry for starters). Access time is not
> critical but low-to-medium mem usage is. Deletion of (and insert of new)
> keys/values will be fairly frequent so should be supported.
>
> I've been thinking of doing it with some simple files/dir structure and
> a "front end" object which hides the details (of marshal/load of the
> objects and mapping of keys to the marshaled data etc) but maybe there
> is something simpler that might be of use?
Following the road less traveled a short way to see where it
leads... You can fairly easily make something that looks something like
a hash but has different internal structure:
class Hash_of_hashes
def initialize(pc=Hash,cc=Hash)
@parent_class,@child_class = uc,lc
@storage = @parent_class.new
@fanout = 100
end
def sub_level(k)
@storage[k.hash.modulo(@fanout)] ||= @child_class.new
end
def [](k)
sub_level(k)[k]
end
def []=(k,v)
sub_level(k)[k] = v
end
end
It also seems like it would be fairly easy to make something that
looked sort of like a hash but used a directory as its storage.
class Hash_on_disk
def initialize(path='.')
@path=path
File.mkdir(@path) unles File.directory? @path
@fanout = 100
end
def sub_key(k)
k.hash.modulo(@fanout)
end
def [](k)
File.open(@path+"/"+sub_key(k),'r') { |f| #read v from f }
end
def []=(k,v)
File.open(@path+"/"+sub_key(k),'w') { |f| #write v to f }
end
end
These are just sketches (you'd want to be able to delete keys/value
pairs, for example), but they are pretty darned simple. Thus it seems
like you should (in about 50 lines of ruby or so) be able to implement
something that looks like a hash but stores the data in a directory tree
that "terminates" in files based on some criterion. You might, for
example, want to store each value in its own file, or you might want to
store the bottom-most hash in a file (or you may want to decide
dynamically). (The reason for going with the tree structure is to avoid
killing your OS.)
The only gotcha's I can see are 1) intra-value references and 2)
hash-key degeneracy. Both should be reasonably easy to deal with if you
are on the lookout for them.
-- Markus