Asp Forum - os.walk restart - comp.lang.python

Keir Vaughan-taylor

3/17/2010 10:04:00 PM

I am traversing a large set of directories using

for root, dirs, files in os.walk(basedir):
run program

Being a huge directory set the traversal is taking days to do a
traversal.
Sometimes it is the case there is a crash because of a programming
error.
As each directory is processed the name of the directory is written to
a file
I want to be able to restart the walk from the directory where it
crashed.

Is this possible?

7 Answers

Steven D'Aprano

3/17/2010 10:43:00 PM

On Wed, 17 Mar 2010 15:04:14 -0700, Keir Vaughan-taylor wrote:

> I am traversing a large set of directories using
>
> for root, dirs, files in os.walk(basedir):
> run program
>
> Being a huge directory set the traversal is taking days to do a
> traversal.
> Sometimes it is the case there is a crash because of a programming
> error.
> As each directory is processed the name of the directory is written to a
> file

What, a proper, honest-to-goodness core dump?

Or do you mean an exception?

> I want to be able to restart the walk from the directory where it
> crashed.
>
> Is this possible?

Quick and dirty with no error-checking:

# Untested
last_visited = open("last_visited.txt", 'r').read()
for root, dirs, files in os.walk(last_visited or basedir):
open("last_visited.txt", 'w').write(root)
run program

--
Steven

Gabriel Genellina

3/17/2010 11:09:00 PM

En Wed, 17 Mar 2010 19:04:14 -0300, Keir Vaughan-taylor <keirvt@gmail.com>
escribió:

> I am traversing a large set of directories using
>
> for root, dirs, files in os.walk(basedir):
> run program
>
> Being a huge directory set the traversal is taking days to do a
> traversal.
> Sometimes it is the case there is a crash because of a programming
> error.
> As each directory is processed the name of the directory is written to
> a file
> I want to be able to restart the walk from the directory where it
> crashed.
>
> Is this possible?

If the 'dirs' list were guaranteed to be sorted, you could remove at each
level all previous directories already traversed. But it's not :(

Perhaps a better approach would be, once, collect all directories to be
processed and write them on a text file -- these are the pending
directories. Then, read from the pending file and process every directory
in it. If the process aborts for any reason, manually delete the lines
already processed and restart.

If you use a database instead of a text file, and mark entries as "done"
after processing, you can avoid that last manual step and the whole
process may be kept running automatically. In some cases you may want to
choose the starting point at random.

--
Gabriel Genellina

alex23

3/18/2010 3:09:00 AM

Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au> wrote:
> # Untested
> last_visited = open("last_visited.txt", 'r').read()
> for root, dirs, files in os.walk(last_visited or basedir):
> open("last_visited.txt", 'w').write(root)
> run program

Wouldn't this only walk the directory the exception occured in and not
the remaining unwalked dirs from basedir?

Something like this should work:

import os
basedir = '.'

walked = open('walked.txt','r').read().split()
unwalked = ((r,d,f) for r,d,f in os.walk(basedir) if r not in walked)

for root, dirs, files in unwalked:
# do something
print root
walked.append(root)

open('walked.txt','w').write('\n'.join(walked))

MRAB

3/18/2010 3:35:00 AM

Keir Vaughan-taylor wrote:
> I am traversing a large set of directories using
>
> for root, dirs, files in os.walk(basedir):
> run program
>
> Being a huge directory set the traversal is taking days to do a
> traversal.
> Sometimes it is the case there is a crash because of a programming
> error.
> As each directory is processed the name of the directory is written to
> a file
> I want to be able to restart the walk from the directory where it
> crashed.
>
> Is this possible?

I would write my own walker which sorts the directory entries it finds
before walking them and can skip the entries until it gets to the
desired starting path, eg if I want to start at "/foo/bar" then skip
over the entries in the root directory which start with a name before
"foo" and the entries in the subdirectory "/foo" which start with a name
before "bar".

Steve Howell

3/18/2010 3:50:00 AM

On Mar 17, 3:04 pm, Keir Vaughan-taylor <kei...@gmail.com> wrote:
> I am traversing a large set of directories using
>
> for root, dirs, files in os.walk(basedir):
> run program
>
> Being a huge directory set the traversal is taking days to do a
> traversal.
> Sometimes it is the case there is a crash because of a programming
> error.
> As each directory is processed the name of the directory is written to
> a file
> I want to be able to restart the walk from the directory where it
> crashed.
>
> Is this possible?

I assume it's the operation that you are doing on each file that is
expensive, not the walk itself.

If that's the case, then you might be able to get away with just
leaving some kind of breadcrumbs whenever you've successfully
processed a directory or a file, so you can quickly short-circuit
entire directories or files on the next run, without having to
implement any kind of complicated start-where-I-left-off before
algorithm.

The breadcrumbs could be hidden files in the file system, or an easy-
indexable list of files that you persist, etc.

What are you doing that takes so long?

Also, I can understand why the operations on the files themselves
might crash, but can't you catch an exception and keep on chugging?

Another option, if you do not do some kind of pruning on the fly, is
to persist the list of files that you need to process up front to a
file, or a database, and persist the index of the last successfully
processed file, so that you can restart as needed from where you left
off.

Steven D'Aprano

3/18/2010 4:05:00 AM

On Wed, 17 Mar 2010 20:08:48 -0700, alex23 wrote:

> Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au> wrote:
>> # Untested
>> last_visited = open("last_visited.txt", 'r').read()
>> for root, dirs, files in os.walk(last_visited or basedir):
>> Â Â Â open("last_visited.txt", 'w').write(root) run program
>
> Wouldn't this only walk the directory the exception occured in and not
> the remaining unwalked dirs from basedir?

Only if you have some sort of branching file hierarchy with multiple sub-
directories in the one directory, instead of a nice simple linear chain
of directories a/b/c/d/.../z as nature intended.

*wink*

Yes, good catch. I said it was untested. You might be able to save the
parent of the current directory, and restart from there instead.

--
Steven

Tim Chase

3/18/2010 10:25:00 AM

Steve Howell wrote:
> If that's the case, then you might be able to get away with just
> leaving some kind of breadcrumbs whenever you've successfully
> processed a directory or a file,

Unless you're indexing a read-only device (whether hardware
read-only like a CD, or permission-wise read-only like a network
share or a non-priv user walking system directories)...

> Also, I can understand why the operations on the files themselves
> might crash, but can't you catch an exception and keep on chugging?

I also wondered this one, perhaps logging the directory in which
the exception happened to later revisit. :)

-tkc

comp.lang.python

os.walk restart

Keir Vaughan-taylor

Steven D'Aprano

Gabriel Genellina

alex23

MRAB

Steve Howell

Steven D'Aprano

Tim Chase

x Login to ForumsZone