Asp Forum - Large disc-based "hash"?

Robert Feldt

10/5/2004 9:12:00 PM

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random
strings grows non-linearly in the file size and thus do not scale. Could
sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a
"full" db.

If it simplifies you can assume the keys and/or values are of fixed
sizes since that is very likely but I'm not sure it will really be
possible to keep that invariant so more flexbile solutions might be
needed. Solution ideas which support caching is nice but not required
(mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose
this is of interest.

Thanks,

Robert

12 Answers

Richard Kilmer

10/5/2004 9:22:00 PM

Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
accessing it.

Bdb link:

http://moulon.inra.fr/ruby/b...

Ruby library:

http://moulon.inra.fr/rub...

-rich

On 10/5/04 5:12 PM, "Robert Feldt" <feldt@ce.chalmers.se> wrote:

> Hi,
>
> I need a simple way to have a potentially very large hash mapping
> Strings to Ruby objects. It needs to scale to on the order of 1 million
> entries with up to 1kb of data per entry (even though it might end up
> with maybe 2-300 bytes per entry for starters). Access time is not
> critical but low-to-medium mem usage is. Deletion of (and insert of new)
> keys/values will be fairly frequent so should be supported.
>
> I've been thinking of doing it with some simple files/dir structure and
> a "front end" object which hides the details (of marshal/load of the
> objects and mapping of keys to the marshaled data etc) but maybe there
> is something simpler that might be of use?
>
> SDBM does not seem to cut it; a simple test with insertion of random
> strings grows non-linearly in the file size and thus do not scale. Could
> sqlite or something similar handle this better?
>
> However, I would rather go for something simple than having to bundle a
> "full" db.
>
> If it simplifies you can assume the keys and/or values are of fixed
> sizes since that is very likely but I'm not sure it will really be
> possible to keep that invariant so more flexbile solutions might be
> needed. Solution ideas which support caching is nice but not required
> (mostly since I can add that later if the need is there).
>
> Any ideas, experience, code or pointers that can help me design/choose
> this is of interest.
>
> Thanks,
>
> Robert
>
>
>

Richard Kilmer

10/5/2004 9:24:00 PM

On 10/5/04 5:22 PM, "Richard Kilmer" <rich@infoether.com> wrote:

> Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
> accessing it.
>
> Bdb link:

Opps... http://www.slee...

>
> http://moulon.inra.fr/ruby/b...
>
> Ruby library:
>
> http://moulon.inra.fr/rub...
>
> -rich
>
>
> On 10/5/04 5:12 PM, "Robert Feldt" <feldt@ce.chalmers.se> wrote:
>
>> Hi,
>>
>> I need a simple way to have a potentially very large hash mapping
>> Strings to Ruby objects. It needs to scale to on the order of 1 million
>> entries with up to 1kb of data per entry (even though it might end up
>> with maybe 2-300 bytes per entry for starters). Access time is not
>> critical but low-to-medium mem usage is. Deletion of (and insert of new)
>> keys/values will be fairly frequent so should be supported.
>>
>> I've been thinking of doing it with some simple files/dir structure and
>> a "front end" object which hides the details (of marshal/load of the
>> objects and mapping of keys to the marshaled data etc) but maybe there
>> is something simpler that might be of use?
>>
>> SDBM does not seem to cut it; a simple test with insertion of random
>> strings grows non-linearly in the file size and thus do not scale. Could
>> sqlite or something similar handle this better?
>>
>> However, I would rather go for something simple than having to bundle a
>> "full" db.
>>
>> If it simplifies you can assume the keys and/or values are of fixed
>> sizes since that is very likely but I'm not sure it will really be
>> possible to keep that invariant so more flexbile solutions might be
>> needed. Solution ideas which support caching is nice but not required
>> (mostly since I can add that later if the need is there).
>>
>> Any ideas, experience, code or pointers that can help me design/choose
>> this is of interest.
>>
>> Thanks,
>>
>> Robert
>>
>>
>>
>
>
>
>

Ara.T.Howard

10/5/2004 10:03:00 PM

Robert Feldt

10/5/2004 10:44:00 PM

Ara.T.Howard@noaa.gov wrote:

> On Wed, 6 Oct 2004, Robert Feldt wrote:
>
>> Hi,
>>
>> I need a simple way to have a potentially very large hash mapping
>> Strings to Ruby objects. It needs to scale to on the order of 1 million
>> entries with up to 1kb of data per entry (even though it might end up
>> with maybe 2-300 bytes per entry for starters). Access time is not
>> critical but low-to-medium mem usage is. Deletion of (and insert of new)
>> keys/values will be fairly frequent so should be supported.
>
>
> if 'fairly frequent' is once per week look at CDB too - it's
> blindingly fast
> and updates (atomic replacements actually) are suprisingly fast.
> reads are
> absolutely insanely fast.
>
fairly frequent is more like "1-5% of keys per hour" but thanks for the
pointer, very interesting. Seems it is unix only but I need windows also
so maybe not.

> bdb is also great.
>
Yes, seems it can handle this load. I tried with a BTree and the
db-file-size-to-added-key+value-bytes ratio seems to stay at 2.5 at
least up to 1e5 entries. Running a test with 1e6 added entries now;
insertion time does not seem to scale linearly but we'll see about file
size.

Thanks to both Rich and you for the pointers.

Downside of going with something like bdb is ease of install/support on
multiple platforms (ie windows since everything works on linux... :))
but I guess bdb should be well supported also on Windows so maybe that's
what it'll be.

> sqlite with adequate indexing may also work - but probably not.
>
Ok, thanks for that info.

/R

rcoder@gmail.com

10/6/2004 12:02:00 AM

If you guys are talking about the same CDB format Bernstein created,
then I don't know that it would be a good match for this kind of task.

CDB has one major weakness: you can make updates in memory quickly, but
persisting them to disk requires time proportional to the total dataset
size. Note that's dataset size, not number of keys -- CDB requires you
to copy the entire old DB to a new file to persist changes.

It *is* rediculously fast for lookups of mostly constant data (hence
the initial 'C'), but probably isn't a good match for any sort of
regularly-updated dataset.

I'd second the recommendation for BDB -- it's reasonably fast, very
scalable, and gets hammered on by all kinds of applications day in and
out.

Gavin Sinclair

10/6/2004 1:59:00 AM

OT: Can someone explain to me the difference between "Berkeley DB" and
("DBM", "GDBM", "SDBM")?

Gavin

On Wednesday, October 6, 2004, 7:22:18 AM, Richard wrote:

> Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
> accessing it.

> Bdb link:

> http://moulon.inra.fr/ruby/b...

> Ruby library:

> http://moulon.inra.fr/rub...

> -rich

> On 10/5/04 5:12 PM, "Robert Feldt" <feldt@ce.chalmers.se> wrote:

>> Hi,
>>
>> I need a simple way to have a potentially very large hash mapping
>> Strings to Ruby objects. It needs to scale to on the order of 1 million
>> entries with up to 1kb of data per entry (even though it might end up
>> with maybe 2-300 bytes per entry for starters). Access time is not
>> critical but low-to-medium mem usage is. Deletion of (and insert of new)
>> keys/values will be fairly frequent so should be supported.
>>
>> I've been thinking of doing it with some simple files/dir structure and
>> a "front end" object which hides the details (of marshal/load of the
>> objects and mapping of keys to the marshaled data etc) but maybe there
>> is something simpler that might be of use?
>>
>> SDBM does not seem to cut it; a simple test with insertion of random
>> strings grows non-linearly in the file size and thus do not scale. Could
>> sqlite or something similar handle this better?
>>
>> However, I would rather go for something simple than having to bundle a
>> "full" db.
>>
>> If it simplifies you can assume the keys and/or values are of fixed
>> sizes since that is very likely but I'm not sure it will really be
>> possible to keep that invariant so more flexbile solutions might be
>> needed. Solution ideas which support caching is nice but not required
>> (mostly since I can add that later if the need is there).
>>
>> Any ideas, experience, code or pointers that can help me design/choose
>> this is of interest.
>>
>> Thanks,
>>
>> Robert
>>
>>
>>

rcoder@gmail.com

10/6/2004 2:15:00 AM

Berkeley DB is an embedded database system developed and supported by
Sleepycat Software (http://www.slee...). It is designed for
applications which need transactional data storage, but don't need (or
want) the additional structure that relational databases provide.

It allows simple numeric or string-valued keys to be associated with a
chunk of data, much like a persistent version of Ruby's native hash
tables. However, unlike a hash, only a small amount of the data has to
be in RAM at once -- Berkeley DB storages can be scaled into the
many-GB range easily. Like most relational databases, it also supports
atomic transactions and data recovery after crashes, as well as
concurrent (more than one process or thread) access to a database.

DBM, GDBM, and SDBM are all database formats (and libraries to access
them) which, like Berkeley DB, allow simple key/value mappings to be
saved to disk, and updated incrementally. However, each has various
limitations as to data size per key, total database size, or
performance on large datasets which Berkeley DB does not. None of them
natively offer transaction support, or handle concurrent access to the
same database gracefully.

The advantage of the *DBM libraries is their simplicity; if your
application uses less than 100,000 total keys, and doesn't need to
store much data in each one, they may offer plenty of functionality,
without requiring you to link in a large, complex library.

More specifically to Ruby, it is worth noting that the *DBM libraries
all have bindings included in the standard Ruby 1.8 distribution, while
Berkeley DB requires you to install BDB, then compile and install the
bindings. This is fine for some applications, but for quick
"download-and-go" tools, it may be more dependencies than you want to
introduce.

majere

10/6/2004 2:25:00 AM

Robert Feldt <feldt@ce.chalmers.se> writes:

>> [...bdb...]

> Yes, seems it can handle this load. I tried with a BTree and
> the db-file-size-to-added-key+value-bytes ratio seems to stay
> at 2.5 at least up to 1e5 entries. Running a test with 1e6
> added entries now; insertion time does not seem to scale
> linearly but we'll see about file size.

My experience with bdb has been very good. Tests with >500
million entries scaled linearly with regards to file-size and
essentially linearly with regards to insert time. I don't have
too much direct experience with remove time, but update is good,
especially if the entries are not expanding.

The key for excellent performance with bdb is selecting fast
persistent storage for the log files.

~r

Markus

10/6/2004 4:00:00 AM

On Tue, 2004-10-05 at 14:12, Robert Feldt wrote:
> Hi,
>
> I need a simple way to have a potentially very large hash mapping
> Strings to Ruby objects. It needs to scale to on the order of 1 million
> entries with up to 1kb of data per entry (even though it might end up
> with maybe 2-300 bytes per entry for starters). Access time is not
> critical but low-to-medium mem usage is. Deletion of (and insert of new)
> keys/values will be fairly frequent so should be supported.
>
> I've been thinking of doing it with some simple files/dir structure and
> a "front end" object which hides the details (of marshal/load of the
> objects and mapping of keys to the marshaled data etc) but maybe there
> is something simpler that might be of use?

Following the road less traveled a short way to see where it
leads... You can fairly easily make something that looks something like
a hash but has different internal structure:

class Hash_of_hashes
def initialize(pc=Hash,cc=Hash)
@parent_class,@child_class = uc,lc
@storage = @parent_class.new
@fanout = 100
end
def sub_level(k)
@storage[k.hash.modulo(@fanout)] ||= @child_class.new
end
def [](k)
sub_level(k)[k]
end
def []=(k,v)
sub_level(k)[k] = v
end
end

It also seems like it would be fairly easy to make something that
looked sort of like a hash but used a directory as its storage.

class Hash_on_disk
def initialize(path='.')
@path=path
File.mkdir(@path) unles File.directory? @path
@fanout = 100
end
def sub_key(k)
k.hash.modulo(@fanout)
end
def [](k)
File.open(@path+"/"+sub_key(k),'r') { |f| #read v from f }
end
def []=(k,v)
File.open(@path+"/"+sub_key(k),'w') { |f| #write v to f }
end
end

These are just sketches (you'd want to be able to delete keys/value
pairs, for example), but they are pretty darned simple. Thus it seems
like you should (in about 50 lines of ruby or so) be able to implement
something that looks like a hash but stores the data in a directory tree
that "terminates" in files based on some criterion. You might, for
example, want to store each value in its own file, or you might want to
store the bottom-most hash in a file (or you may want to decide
dynamically). (The reason for going with the tree structure is to avoid
killing your OS.)
The only gotcha's I can see are 1) intra-value references and 2)
hash-key degeneracy. Both should be reasonably easy to deal with if you
are on the lookout for them.

-- Markus

Ara.T.Howard

10/6/2004 4:12:00 AM

comp.lang.ruby

Large disc-based "hash"?

Robert Feldt

Richard Kilmer

Richard Kilmer

Ara.T.Howard

Robert Feldt

rcoder@gmail.com

Gavin Sinclair

rcoder@gmail.com

majere

Markus

Ara.T.Howard

x Login to ForumsZone