[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Too many open files

AMD

2/4/2008 12:58:00 PM

Hello,

I need to split a very big file (10 gigabytes) into several thousand
smaller files according to a hash algorithm, I do this one line at a
time. The problem I have is that opening a file using append, writing
the line and closing the file is very time consuming. I'd rather have
the files all open for the duration, do all writes and then close them
all at the end.
The problem I have under windows is that as soon as I get to 500 files I
get the Too many open files message. I tried the same thing in Delphi
and I can get to 3000 files. How can I increase the number of open files
in Python?

Thanks in advance for any answers!

Andre M. Descombes
9 Answers

Jeff

2/4/2008 1:07:00 PM

0

Why don't you start around 50 threads at a time to do the file
writes? Threads are effective for IO. You open the source file,
start a queue, and start sending data sets to be written to the
queue. Your source file processing can go on while the writes are
done in other threads.

Christian Heimes

2/4/2008 2:50:00 PM

0

Jeff wrote:
> Why don't you start around 50 threads at a time to do the file
> writes? Threads are effective for IO. You open the source file,
> start a queue, and start sending data sets to be written to the
> queue. Your source file processing can go on while the writes are
> done in other threads.

I'm sorry, but you are totally wrong. Threads are a very bad idea for IO
bound operation. Asynchronous event IO is the best answer for any IO
bound problem. That is select, poll, epoll, kqueue or IOCP.

Christian

Steven D'Aprano

2/4/2008 2:55:00 PM

0

On Mon, 04 Feb 2008 13:57:39 +0100, AMD wrote:

> The problem I have under windows is that as soon as I get to 500 files I
> get the Too many open files message. I tried the same thing in Delphi
> and I can get to 3000 files. How can I increase the number of open files
> in Python?

Windows XP has a limit of 512 files opened by any process, including
stdin, stdout and stderr, so your code is probably failing after file
number 509.

http://forums.devx.com/archive/index.php/t-1...

It's almost certainly not a Python problem, because under Linux I can
open 1000+ files without blinking.

I don't know how Delphi works around that issue. Perhaps one of the
Windows gurus can advise if there's a way to increase that limit from 512?



--
Steven

Larry Bates

2/4/2008 3:30:00 PM

0

AMD wrote:
> Hello,
>
> I need to split a very big file (10 gigabytes) into several thousand
> smaller files according to a hash algorithm, I do this one line at a
> time. The problem I have is that opening a file using append, writing
> the line and closing the file is very time consuming. I'd rather have
> the files all open for the duration, do all writes and then close them
> all at the end.
> The problem I have under windows is that as soon as I get to 500 files I
> get the Too many open files message. I tried the same thing in Delphi
> and I can get to 3000 files. How can I increase the number of open files
> in Python?
>
> Thanks in advance for any answers!
>
> Andre M. Descombes

Not quite sure what you mean by "a hash algorithm" but if you sort the file
(with external sort program) on what you want to split on, then you only have to
have 1 file at a time open.

-Larry

Duncan Booth

2/4/2008 3:57:00 PM

0

Steven D'Aprano <steve@REMOVE-THIS-cybersource.com.au> wrote:

> On Mon, 04 Feb 2008 13:57:39 +0100, AMD wrote:
>
>> The problem I have under windows is that as soon as I get to 500 files I
>> get the Too many open files message. I tried the same thing in Delphi
>> and I can get to 3000 files. How can I increase the number of open files
>> in Python?


> Windows XP has a limit of 512 files opened by any process, including
> stdin, stdout and stderr, so your code is probably failing after file
> number 509.

No, the C runtime has a limit of 512 files, the OS limit is actually 2048.
See http://msdn2.microsoft.com/en-us/librar...(VS.71).aspx

> I don't know how Delphi works around that issue. Perhaps one of the
> Windows gurus can advise if there's a way to increase that limit from
> 512?
>

Call the C runtime function _setmaxstdio(n) to set the maxmimum the number
of open files to n up to 2048. Alternatively os.open() and os.write()
should bypass the C runtime limit.

It would probably be better though to implement some sort of caching scheme
in memory and avoid having to mess with the limits at all. Or do it in two
passes: creating 100 files on the first pass and splitting each of those in
a second pass.

Gary Herron

2/4/2008 4:27:00 PM

0

AMD wrote:
> Hello,
>
> I need to split a very big file (10 gigabytes) into several thousand
> smaller files according to a hash algorithm, I do this one line at a
> time. The problem I have is that opening a file using append, writing
> the line and closing the file is very time consuming. I'd rather have
> the files all open for the duration, do all writes and then close them
> all at the end.
> The problem I have under windows is that as soon as I get to 500 files I
> get the Too many open files message. I tried the same thing in Delphi
> and I can get to 3000 files. How can I increase the number of open files
> in Python?
>
> Thanks in advance for any answers!
>
> Andre M. Descombes
>
Try something like this:

Instead of opening several thousand files:

* Create several thousand lists.

* Open the input file and process each line, dropping it into the
correct list.

* Whenever a single list passes some size threshold, open its file,
write the batch, and immediately close the file.

* Similarly at the end (or when the total of all lists passes sme size
threshold), loop through the several thousand lists, opening, writing,
and closing.

This will keep the open/write/closes operations to a minimum, and you'll
never have more than 2 files open at a time. Both of those are wins for
you.

Gary Herron

Gabriel Genellina

2/4/2008 5:10:00 PM

0

En Mon, 04 Feb 2008 12:50:15 -0200, Christian Heimes <lists@cheimes.de>
escribi�:

> Jeff wrote:
>> Why don't you start around 50 threads at a time to do the file
>> writes? Threads are effective for IO. You open the source file,
>> start a queue, and start sending data sets to be written to the
>> queue. Your source file processing can go on while the writes are
>> done in other threads.
>
> I'm sorry, but you are totally wrong. Threads are a very bad idea for IO
> bound operation. Asynchronous event IO is the best answer for any IO
> bound problem. That is select, poll, epoll, kqueue or IOCP.

The OP said that he has this problem on Windows. The available methods
that I am aware of are:
- using synchronous (blocking) I/O with multiple threads
- asynchronous I/O using OVERLAPPED and wait functions
- asynchronous I/O using IO completion ports

Python does not (natively) support any of the latter ones, only the first.
I don't have any evidence proving that it's a very bad idea as you claim;
altough I wouldn't use 50 threads as suggested above, but a few more than
the number of CPU cores.

--
Gabriel Genellina

Dennis Lee Bieber

2/4/2008 7:01:00 PM

0

On Mon, 04 Feb 2008 08:27:08 -0800, Gary Herron
<gherron@islandtraining.com> declaimed the following in
comp.lang.python:

> * Whenever a single list passes some size threshold, open its file,
for append,
> write the batch, and immediately close the file.
... and reset the list to empty

<G> Might as well be explicit on the requirements...
--
Wulfraed Dennis Lee Bieber KD6MOG
wlfraed@ix.netcom.com wulfraed@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: web-asst@bestiaria.com)
HTTP://www.bestiaria.com/

AMD

2/5/2008 8:18:00 AM

0

Thank you every one,

I ended up using a solution similar to what Gary Herron suggested :
Caching the output to a list of lists, one per file, and only doing the
IO when the list reaches a certain treshold.
After playing around with the list threshold I ended up with faster
execution times than originally and while having a maximum of two files
open at a time! Its only a matter of trading memory for open files.
It could be that using this strategy with asynchronous IO or threads
could yield even faster times, but I haven't tested it.
Again, much appreciated thanks for all your suggestions.

Andre M. Descombes

> Hello,
>
> I need to split a very big file (10 gigabytes) into several thousand
> smaller files according to a hash algorithm, I do this one line at a
> time. The problem I have is that opening a file using append, writing
> the line and closing the file is very time consuming. I'd rather have
> the files all open for the duration, do all writes and then close them
> all at the end.
> The problem I have under windows is that as soon as I get to 500 files I
> get the Too many open files message. I tried the same thing in Delphi
> and I can get to 3000 files. How can I increase the number of open files
> in Python?
>
> Thanks in advance for any answers!
>
> Andre M. Descombes