Asp Forum - Scalable python dict {'key_is_a_string': [count, some_val]}

krishna

2/20/2010 6:36:00 AM

I have to manage a couple of dicts with huge dataset (larger than
feasible with the memory on my system), it basically has a key which
is a string (actually a tuple converted to a string) and a two item
list as value, with one element in the list being a count related to
the key. I have to at the end sort this dictionary by the count.

The platform is linux. I am planning to implement it by setting a
threshold beyond which I write the data into files (3 columns: 'key
count some_val' ) and later merge those files (I plan to sort the
individual files by the key column and walk through the files with one
pointer per file and merge them; I would add up the counts when
entries from two files match by key) and sorting using the 'sort'
command. Thus the bottleneck is the 'sort' command.

Any suggestions, comments?

By the way, is there a linux command that does the merging part?

Thanks,
Krishna

4 Answers

Paul Rubin

2/20/2010 6:48:00 AM

krishna <krishna.k.0001@gmail.com> writes:
> entries from two files match by key) and sorting using the 'sort'
> command. Thus the bottleneck is the 'sort' command.

That is a good approach. The sort command is highly optimized and will
beat any Python program that does something comparable. Set LC_ALL=C if
the file is all ascii, since that will bypass a lot of slow Unicode
conversion and make sorting go even faster.

> By the way, is there a linux command that does the merging part?

sort -m

Note that the sort command already does external sorting, so if you
can just write out one large file and sort it, instead of sorting and
then merging a bunch of smaller files, that may simplify your task.

Jonathan Gardner

2/20/2010 7:28:00 AM

On Fri, Feb 19, 2010 at 10:36 PM, krishna <krishna.k.0001@gmail.com> wrote:
> I have to manage a couple of dicts with huge dataset (larger than
> feasible with the memory on my system), it basically has a key which
> is a string (actually a tuple converted to a string) and a two item
> list as value, with one element in the list being a count related to
> the key. I have to at the end sort this dictionary by the count.
>
> The platform is linux. I am planning to implement it by setting a
> threshold beyond which I write the data into files (3 columns: 'key
> count some_val' ) and later merge those files (I plan to sort the
> individual files by the key column and walk through the files with one
> pointer per file and merge them; I would add up the counts when
> entries from two files match by key) and sorting using the 'sort'
> command. Thus the bottleneck is the 'sort' command.
>
> Any suggestions, comments?
>

You should be using BDBs or even something like PostgreSQL. The
indexes there will give you the scalability you need. I doubt you will
be able to write anything that will select, update, insert or delete
data better than what BDBs and PostgreSQL can give you.

--
Jonathan Gardner
jgardner@jonathangardner.net

Arnaud Delobelle

2/20/2010 11:04:00 AM

On 20 Feb, 06:36, krishna <krishna.k.0...@gmail.com> wrote:
> I have to manage a couple of dicts with huge dataset (larger than
> feasible with the memory on my system), it basically has a key which
> is a string (actually a tuple converted to a string) and a two item
> list as value, with one element in the list being a count related to
> the key. I have to at the end sort this dictionary by the count.
>
> The platform is linux. I am planning to implement it by setting a
> threshold beyond which I write the data into files (3 columns: 'key
> count some_val' ) and later merge those files (I plan to sort the
> individual files by the key column and walk through the files with one
> pointer per file and merge them; I would add up the counts when
> entries from two files match by key) and sorting using the 'sort'
> command. Thus the bottleneck is the 'sort' command.
>
> Any suggestions, comments?
>
> By the way, is there a linux command that does the merging part?
>
> Thanks,
> Krishna

Have you looked here? http://docs.python.org/library/persis...

--
Arnaud

geremy condra

3/10/2010 6:50:00 PM

On Wed, Mar 10, 2010 at 11:47 AM, Krishna K <krishna.k.0001@gmail.com> wrote:
>
>
> On Fri, Feb 19, 2010 at 11:27 PM, Jonathan Gardner
> <jgardner@jonathangardner.net> wrote:
>>
>> On Fri, Feb 19, 2010 at 10:36 PM, krishna <krishna.k.0001@gmail.com>
>> wrote:
>> > I have to manage a couple of dicts with huge dataset (larger than
>> > feasible with the memory on my system), it basically has a key which
>> > is a string (actually a tuple converted to a string) and a two item
>> > list as value, with one element in the list being a count related to
>> > the key. I have to at the end sort this dictionary by the count.
>> >
>> > The platform is linux. I am planning to implement it by setting a
>> > threshold beyond which I write the data into files (3 columns: 'key
>> > count some_val' ) and later merge those files (I plan to sort the
>> > individual files by the key column and walk through the files with one
>> > pointer per file and merge them; I would add up the counts when
>> > entries from two files match by key) and sorting using the 'sort'
>> > command. Thus the bottleneck is the 'sort' command.
>> >
>> > Any suggestions, comments?
>> >
>>
>> You should be using BDBs or even something like PostgreSQL. The
>> indexes there will give you the scalability you need. I doubt you will
>> be able to write anything that will select, update, insert or delete
>> data better than what BDBs and PostgreSQL can give you.
>>
>> --
>> Jonathan Gardner
>> jgardner@jonathangardner.net
>
> Thank you. I tried BDB, it seems to get very very slow as you scale.
>
> Thank you,
> Krishna

Have you tried any of the big key-value store systems, like couchdb etc?

Geremy Condra

comp.lang.python

Scalable python dict {'key_is_a_string': [count, some_val]}

krishna

Paul Rubin

Jonathan Gardner

Arnaud Delobelle

geremy condra

x Login to ForumsZone