Asp Forum - Re: Counting the files in a directory....

Robert Klemme

1/11/2008 4:51:00 PM

On 11.01.2008 16:19, Kyle Schmitt wrote:
> I'm writing some scripts to help manage a mail scanner used at my
> work. Being a mail scanner, it's got huuuuUUUge quarantine
> directories.
>
> Now, I know I can do something along the lines of:
>
> Dir.open("/foo").collect.length-2 #if you're wondering, the -2 is to
> ignore . and ..

You could as well do

count = Dir.entries("/foo").size - 2

> to get a count of what's in a directory, but the problem there is,
> it's rather slow when you run that in a directory with a few thousand
> files on a server under a severe (4.5>average_load>2) load.
>
> After perusing the Dir, Find and Stat classes, I haven't seen a better way.
> I thought that perhaps there was some sort of system call, at least in
> Real OSes™ (Linux, *BSD, Unix, etc), that would return the number of
> files inside of a directory. Something that would hopefully return in
> a 1/4th or 1/8th a second, rather than in 4 or 8 (or 20...) seconds.
>
> Any clues?

The major time will be IO and that cannot be changed I guess. You could
however do some form of caching: read the size and the last mod date of
each dir you are interested in and store that in a Hash (and write that
via Marshal to disk between invocations if you process terminates in
between). Then you need only check whether the mod date has changed and
only read the directory if it has. Disadvantage is that you need one
more IO - albeit that will pull just one block so it might pay off.

Kind regards

robert

6 Answers

Kyle Schmitt

1/11/2008 6:14:00 PM

Entries seems to be fairly identical to collect, and it does look nicer...
but yea still slow.

The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it's often enough that we want to be
able to do it easily.

In other cases than a mail system, caching would be a very good idea though.

I'll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.

Thanks,
Kyle

Mike Fletcher

1/11/2008 7:06:00 PM

Kyle Schmitt wrote:
> Entries seems to be fairly identical to collect, and it does look
> nicer...
> but yea still slow.
>
> The problem with caching is that we only keep quarantine directories
> around for 10 days, due to their size and the relative rarity of us
> needing to pull something out of it. One reason for writing this as a
> script is that we recover rarely enough that whoever is doing it
> forgot how to recover. Still, it's often enough that we want to be
> able to do it easily.

If there's a large number of files in these directories that's probably
the source of the slowness, not the method used to get the list of
entries.

Many filesystems (some less than others) don't behave as well when you
get a "large" number of files in one directory. I think the rule of
thumb I've used for ext2 filesystems is you'll start to notice a delay
when you get a few hundred entries, and you'll start to feel it when you
have thousands.

One way around this (short of installing / upgrading to a new underlying
filesystem that handles these cases better (xfs, for example)) is to
split files out into a directory tree based either on the filename
directly or a hash made from the real filename (say an MD5 hex string of
the filename and you make two levels based on the first 4 hex digits,
00/00, 00/01, ..., ff/fe, ff/ff; 00/00 contains all files for which the
hashed filename begins "0000...", etc.). The downside of this is that
you either have to walk the entire tree to see the contents, or keep an
external index of the contents (which would eliminate your needing to do
what you're trying to do and the justification for splitting things up,
but . . . :).

--
Posted via http://www.ruby-....

Robert Klemme

1/11/2008 10:21:00 PM

On 11.01.2008 19:14, Kyle Schmitt wrote:
> Entries seems to be fairly identical to collect, and it does look nicer...
> but yea still slow.

As I said: it's the IO for crowded directories (see also Mike's reply).

> The problem with caching is that we only keep quarantine directories
> around for 10 days, due to their size and the relative rarity of us
> needing to pull something out of it. One reason for writing this as a
> script is that we recover rarely enough that whoever is doing it
> forgot how to recover. Still, it's often enough that we want to be
> able to do it easily.
>
> In other cases than a mail system, caching would be a very good idea though.

I am not sure I understand why you think it is a bad idea. If you only
cache the number of files per directory where is the issue? Or is this
script not invoked regularly? Probably I am missing a bit of your use case.

> I'll try and read more of the C stuff for handling files/directories
> in unix. I can hold out hope for awhile.

Won't help. It's really the size of the directory. Maybe you give a
little more detail about your script and when it's used so we can come
up with better suggestions.

Cheers

robert

Kyle Schmitt

1/14/2008 5:10:00 PM

Robert,
The script itself won't be run as routinely as the
directories are rotated. The directories have a daily rotation so
there are only the most recent 10 days available at once, but the
script itself may only be invoked once or twice in a month, at most.

I understand that the size of the directory itself is a problem, but I
was hoping that somehow there was a way to get a simple, more
efficient count. I know the b-tree based file systems are somewhat
new in unix & unix-like systems, I was just hoping there was some more
efficient way :)

The script itself (as it stands now, albeit slower than I would have
liked) does the following:
With no arguments, lists the number of quarantined and spam messages
being held, for each day.
With a date, lists the file names of the quarantined messages, as well
as their recipients.
With a date and the file name of a quarantined message, warns the
user, asks them if they want to continue, then moves the message back
into the appropriate queue to be delivered.

Thanks

--Kyle

Kyle Schmitt

1/14/2008 5:18:00 PM

On Jan 11, 2008 1:06 PM, Mike Fletcher <lemurific+rforum@gmail.com> wrote:
> Kyle Schmitt wrote:
> > Entries seems to be fairly identical to collect, and it does look
> > nicer...
> > but yea still slow.
> >
> > The problem with caching is that we only keep quarantine directories
> > around for 10 days, due to their size and the relative rarity of us
> > needing to pull something out of it. One reason for writing this as a
> > script is that we recover rarely enough that whoever is doing it
> > forgot how to recover. Still, it's often enough that we want to be
> > able to do it easily.
>
> If there's a large number of files in these directories that's probably
> the source of the slowness, not the method used to get the list of
> entries.
>
>
> Many filesystems (some less than others) don't behave as well when you
> get a "large" number of files in one directory. I think the rule of
> thumb I've used for ext2 filesystems is you'll start to notice a delay
> when you get a few hundred entries, and you'll start to feel it when you
> have thousands.
>
>
> One way around this (short of installing / upgrading to a new underlying
> filesystem that handles these cases better (xfs, for example)) is to
> split files out into a directory tree based either on the filename
> directly or a hash made from the real filename (say an MD5 hex string of
> the filename and you make two levels based on the first 4 hex digits,
> 00/00, 00/01, ..., ff/fe, ff/ff; 00/00 contains all files for which the
> hashed filename begins "0000...", etc.). The downside of this is that
> you either have to walk the entire tree to see the contents, or keep an
> external index of the contents (which would eliminate your needing to do
> what you're trying to do and the justification for splitting things up,
> but . . . :).
>
>
> --
> Posted via http://www.ruby-....
>
>

Mike,
I've been an advocate of using the right file system for the
job for ages now, but the sad matter is, this is running on a rather
old version of RedHat, which doesn't support anything real other than
ext2 & 3. As for our possible upgrade paths to this box, it would
still be RedHat, or a clone (CentOS). From what I can see, they still
don't support modern file systems by default. Admittedly I'm tempted
to add the support myself (it's not hard), but then it'll bring up the
"its a production system" argument here.

*sigh*
--Kyle

Reid Thompson

1/15/2008 2:57:00 PM

Kyle Schmitt wrote:
>
>
> I'll try and read more of the C stuff for handling files/directories
> in unix. I can hold out hope for awhile.
>
> Thanks,
> Kyle
>

You may have already gotten here....
What kind of times does this give? ( the first run will include the initial
compilation time )
You can modify it to meet your needs ( if you have questions, just post back
) -- see man scandir
you can setup a filter function to allow returning counts for specific file
matches.
As is, it returns a count for all files, visible and hidden.

for rubyinline see:
http://www.zenspider.com/ZSS/Products/R...

https://rubyforge.org/projects/...

-----------snip dircount.rb--------------------------------
require 'inline'

class DirCount
inline do | builder |
builder.include '<dirent.h>'
builder.include '<stdio.h>'
builder.c "
int count() {
struct dirent **namelist;
int n;
int count;

count = n = scandir(\".\", &namelist, 0, 0);
if (n < 0)
perror(\"scandir\");
else {
while(n--) {
/* printf(\"%s\n\", namelist[n]->d_name);*/
free(namelist[n]);
}
free(namelist);
}

return (count);
}"
end
end

dc = DirCount.new()
puts dc.count()
-----------snip--------------------

--
View this message in context: http://www.nabble.com/Counting-the-files-in-a-directory....-tp14758608p148...
Sent from the ruby-talk mailing list archive at Nabble.com.

comp.lang.ruby

Re: Counting the files in a directory....

Robert Klemme

Kyle Schmitt

Mike Fletcher

Robert Klemme

Kyle Schmitt

Kyle Schmitt

Reid Thompson

x Login to ForumsZone