Asp Forum - Sorting Large File (Code/Performance

Ira.Kovac

1/24/2008 7:19:00 PM

Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

I'd greatly appreciate if someone can post sample code that can help
me do this.

Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).

Cheers,

Ira

23 Answers

Paul Rubin

1/24/2008 7:42:00 PM

Ira.Kovac@gmail.com writes:
> I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
> to sort based on first two characters.
>
> I'd greatly appreciate if someone can post sample code that can help
> me do this.

Use the unix sort command:

sort inputfile -o outputfile

I think there is a cygwin port.

> Also, any ideas on approximately how long is the sort process going to
> take (XP, Dual Core 2.0GHz w/2GB RAM).

Eh, unix sort would probably take a while, somewhere between 15
minutes and an hour. If you only have to do it once it's not worth
writing special purpose code. If you have to do it a lot, get some
more ram for that box, suck the file into memory and do a radix sort.

John Nagle

1/24/2008 7:57:00 PM

Ira.Kovac@gmail.com wrote:
> Hello all,
>
> I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
> to sort based on first two characters.

Given those numbers, the average number of characters per line is
less than 2. Please check.

John Nagle

John Machin

1/24/2008 8:44:00 PM

On Jan 25, 6:18 am, Ira.Ko...@gmail.com wrote:
> Hello all,
>
> I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
> to sort based on first two characters.

If you mean 1.6 American billion i.e. 1.6 * 1000 ** 3 lines, and 2 *
1024 ** 3 bytes of data, that's 1.34 bytes per line. If you mean other
definitions of "billion" and/or "GB", the result is even fewer bytes
per line.

What is a "Unicode text file"? How is it encoded: utf8, utf16,
utf16le, utf16be, ??? If you don't know, do this:

print repr(open('the_file', 'rb').read(100))

and show us the results.

What does "based on [the] first two characters" mean? Do you mean raw
order based on the ordinal of each character i.e. no fancy language-
specific collating sequence? Do the first two characters always belong
to the ASCII subset?

You'd like to sort a large file? Why? Sorting a file is just a means
to an end, and often another means is more appropriate. What are you
going to do with it after it's sorted?

> I'd greatly appreciate if someone can post sample code that can help
> me do this.

I'm sure you would. However it would benefit you even more if instead
of sitting on the beach next to the big arrow pointing to the drop
zone, you were to read the manual and work out how to do it yourself.
Here's a start: http://docs.python.org/lib/typesseq-mu...

> Also, any ideas on approximately how long is the sort process going to
> take (XP, Dual Core 2.0GHz w/2GB RAM).

If you really have a 2GB file and only 2GB of RAM, I suggest that you
don't hold your breath.

Instead of writing Python code, you are probably better off doing an
external sort. You might consider looking for a Windows port of a
Unicode-capable Unix sort utility. Google "GnuWin32" and see if their
sort does what you want.

Ira.Kovac

1/24/2008 9:27:00 PM

Thanks to all who replied. It's very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:

>What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:
The file is UTF-8

> Do the first two characters always belong to the ASCII subset?
Yes, first two always belong to ASCII subset

> What are you going to do with it after it's sorted?
I need to isolate all lines that start with two characters (zz to be
particular)

> Here's a start: http://docs.python.org/lib/typesseq-mu...
> Google "GnuWin32" and see if their sort does what you want.
Will do, thanks for the tip.

> If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.
I am limited with resources. Unfortunately.

Cheers,

Ira

Stefan Behnel

1/24/2008 9:36:00 PM

Ira.Kovac@gmail.com wrote:
>> What are you going to do with it after it's sorted?
> I need to isolate all lines that start with two characters (zz to be
> particular)

"Isolate" as in "extract"? Remove the rest?

Then why don't you extract the lines first, without sorting the file? (or sort
it afterwards if you still need to). That would heavily cut down your memory
footprint.

Stefan

Stefan Behnel

1/24/2008 9:40:00 PM

Stefan Behnel wrote:
> Ira.Kovac@gmail.com wrote:
>>> What are you going to do with it after it's sorted?
>> I need to isolate all lines that start with two characters (zz to be
>> particular)
>
> "Isolate" as in "extract"? Remove the rest?
>
> Then why don't you extract the lines first, without sorting the file? (or sort
> it afterwards if you still need to). That would heavily cut down your memory
> footprint.

Just for fun, this is what I meant:

for utf8_line in open(filename, 'rb'):
if utf8_line.startswith('zz'):
print utf8_line

Stefan

John Machin

1/24/2008 9:54:00 PM

On Jan 25, 8:26 am, Ira.Ko...@gmail.com wrote:

> I need to isolate all lines that start with two characters (zz to be
> particular)

What does "isolate" mean to you? What does this have to do with
sorting? What do you actually want to do with (a) the lines starting
with "zz" (b) the other lines? What percentage of the lines start with
"zz"?

When looking at the GnuWin32 collection:
(a) "Unicode" is not relevant to your sort problem.
(b) grab yourself a wc and a grep while you're there -- they will help
with "how many lines" and "what percentage of lines" questions.

Martin Marcher

1/24/2008 10:50:00 PM

On Thursday 24 January 2008 20:56 John Nagle wrote:

> Ira.Kovac@gmail.com wrote:
>> Hello all,
>>
>> I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
>> to sort based on first two characters.
>
> Given those numbers, the average number of characters per line is
> less than 2. Please check.

which would be true if 1.599.999.999 had 2 chars and the rest of the lines
just one :)

(but yes that would be an interesting question how to sort a 1 character
line based on the first 2 of that line)

martin

--
http://noneisyours.ma...
http://feeds.feedburner.com/N...

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

John Nagle

1/25/2008 12:14:00 AM

Ira.Kovac@gmail.com wrote:
> Thanks to all who replied. It's very appreciated.
>
> Yes, I had to double check line counts and the number of lines is ~16
> million (instead of stated 1.6B).

OK, that's not bad at all.

You have a few options:

- Get enough memory to do the sort with an in-memory sort, like UNIX "sort"
or Python's "sort" function.
- Thrash; in-memory sorts do very badly with virtual memory, but eventually
they finish. Might take many hours.
- Get a serious disk-to-disk sort program. (See "http://www.ordinal.....
There's a free 30-day trial. It can probably sort your data
in about a minute.)
- Load the data into a database like MySQL and let it do the work.
This is slow if done wrong, but OK if done right.
- Write a distribution sort yourself. Fan out the incoming file into
one file for each first letter, sort each subfile, merge the
results.

With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
a standard in-memory sort.

John Nagle

Paul Rubin

1/25/2008 12:15:00 AM

John Nagle <nagle@animats.com> writes:
> - Get enough memory to do the sort with an in-memory sort, like
> UNIX "sort" or Python's "sort" function.

Unix sort does external sorting when needed.

comp.lang.python

Sorting Large File (Code/Performance

Ira.Kovac

Paul Rubin

John Nagle

John Machin

Ira.Kovac

Stefan Behnel

Stefan Behnel

John Machin

Martin Marcher

John Nagle

Paul Rubin

x Login to ForumsZone