Asp Forum - Moving files matching Regex

Mark Woodward

11/25/2007 10:00:00 AM

Hi all,

A question relating to the code below.

What I'm trying to do is:
Files named CAL221107.xls, NCPH141202.xls or GOH333333.xls for example
in the current directory should be appended to an array (just the file
name, not the full path) and moved to ./sent/

It seems to be working, but is there a more Rubyish way of doing it?

-------------------------------------------------------------------------
require 'find'
require 'fileutils'

a=[]
fglob = Regexp.new('(CAL|NCPH|GOH)\d{6}\.xls')

Find.find('./') do |path|
curr_file = File.basename(path)
if curr_file =~ /#{fglob}/
a << curr_file
new_file = "./sent/#{curr_file}"
FileUtils.mv(curr_file, new_file) if not File.exists?(new_file)
end
end

puts a

-------------------------------------------------------------------------

thanks,

--
Mark

15 Answers

Stefano Crocco

11/25/2007 10:30:00 AM

Alle domenica 25 novembre 2007, Mark Woodward ha scritto:
> Hi all,
>
> A question relating to the code below.
>
> What I'm trying to do is:
> Files named CAL221107.xls, NCPH141202.xls or GOH333333.xls for example
> in the current directory should be appended to an array (just the file
> name, not the full path) and moved to ./sent/
>
> It seems to be working, but is there a more Rubyish way of doing it?
>
> -------------------------------------------------------------------------
> require 'find'
> require 'fileutils'
>
> a=[]
> fglob = Regexp.new('(CAL|NCPH|GOH)\d{6}\.xls')
>
> Find.find('./') do |path|
> curr_file = File.basename(path)
> if curr_file =~ /#{fglob}/
> a << curr_file
> new_file = "./sent/#{curr_file}"
> FileUtils.mv(curr_file, new_file) if not File.exists?(new_file)
> end
> end
>
> puts a
>
> -------------------------------------------------------------------------
>
> thanks,

Not very different:

require 'find'
require 'fileutils'

a=[]
fglob = /(CAL|NCPH|GOH)\d{6}\.xls/

Find.find('.') do |f|
name = File.basename f
if name=~ fglob
a << name
FileUtils.mv f, "sent/#{name}" unless File.exist? "sent/#{name}"
end
end

A couple of notes:
* You can create a regexp using the // syntax, instead of using Regexp.new
* when you call String#=~ you don't need to use /#{...}/, if you already have
a regexp
* I think you need to pass path, and not curr_file, as the first argument of
FileUtils.mv, otherwise it won't work if the file is in a subdirectory.

Stefano

Mark Woodward

11/25/2007 11:06:00 AM

Hi Stefano,

On Sun, 25 Nov 2007 05:30:10 -0500
Stefano Crocco <stefano.crocco@alice.it> wrote:

>
> Not very different:
>
> require 'find'
> require 'fileutils'
>
> a=[]
> fglob = /(CAL|NCPH|GOH)\d{6}\.xls/
>
> Find.find('.') do |f|
> name = File.basename f
> if name=~ fglob
> a << name
> FileUtils.mv f, "sent/#{name}" unless File.exist? "sent/#{name}"
> end
> end
>
> A couple of notes:
> * You can create a regexp using the // syntax, instead of using
> Regexp.new

that's better, thanks

> * when you call String#=~ you don't need to use /#{...}/, if you
> already have a regexp

another good thing to know.

> * I think you need to pass path, and not curr_file, as the first
> argument of FileUtils.mv, otherwise it won't work if the file is in a
> subdirectory.

There won't be any sub directories except for /sent.
I've discovered from the cookbook I can use prune to ignore it.
Below's where I'm up to.

>
> Stefano
>

appreciate your help,
cheers,

--
Mark

---------------------------------------------------------------------------
require 'find'
require 'fileutils'

FILE_REGEX = /(CAL|NCPH|GOH)\d{6}\.xls/
SENT_DIR = "./sent/"

def get_file_names
fn=[]
Find.find('./') do |path|
Find.prune if File.basename(path) == 'sent'
curr_file = File.basename(path)
if curr_file =~ FILE_REGEX
fn << curr_file
end
end
fn
end

def move_files
@a.each do |fn|
new_loc = SENT_DIR + fn
FileUtils.mv(fn, new_loc) if not File.exists?(new_loc)
end
end

@a = get_file_names
move_files
puts @a

Raul Raul

11/25/2007 11:49:00 AM

Mark Woodward wrote:
> There won't be any sub directories except for /sent.

I add this to the advice from Stefano:

1) if you just want to find the files in the current directory, then we
do not need to use Find (which, as Stefano pointed out, could find files
in subdirectories that you may not foresee now..); and we can collect
the files in one shot with grep.

2) for precaution, it is better to match the complete file name (just in
case there may be other files with that name + an extension, eg like
saved, etc).

3) we could place the method that moves the files into class Array,
rather than leaving it in the toplevel. (Consider a param for the
destination dir, unless you are sure that it will be always a Constant).

class Array
def move_to_subdir
self.each do |fn|
new_loc = SENT_DIR + fn
FileUtils.mv(fn, new_loc) unless File.exists?(new_loc)
end
end
end

# the main is now 4 lines:

fglob= %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x

all_files = Dir.glob("*")
my_files = all_files.grep(fglob)

my_files.move_to_subdir

Regards,

Raul
--
Posted via http://www.ruby-....

Robert Klemme

11/25/2007 4:22:00 PM

On 25.11.2007 12:48, Raul Parolari wrote:
> 3) we could place the method that moves the files into class Array,
> rather than leaving it in the toplevel. (Consider a param for the
> destination dir, unless you are sure that it will be always a Constant).

I would not do that because it is not really an Array operation. Having
a top level method (aka function) is better IMHO.

Kind regards

robert

Raul Raul

11/25/2007 10:46:00 PM

Robert Klemme wrote:

> I would not do that because it is not really an Array operation.

You are right; it does not belong in class Array.

> Having a top level method (aka function) is better IMHO.

Here I use the occasion to share a perplexity I have, when I look at
Ruby programs, and I sometimes see a myriad of top-level methods. If I
ask the reason, the answer is a vague "they do not really fit well
anywhere", which leaves me a bit disconcerted (especially when the
script is part of a project that may grow, not something written to be
used once).

In the case discussed, I would place this method in a module (that would
just need to be required):

module MyUtils
def self.mv_files_to_subdir(files, subdir)
files.each do |curr_file|
..
end
end
end

or, as this is in fact a file operation, I would consider to reopen
FileUtils and add the method:

#--
module FileUtils
def self.mv_files_to_subdir(files, subdir)
files.each do |curr_file|
new_file = "./#{subdir}/#{curr_file}"
FileUtils.mv(curr_file, new_file) unless File.exists?(new_file)
end
end
end

fglob= %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
all_files = Dir.glob("*")
my_files = all_files.grep(fglob)

FileUtils.mv_files_to_subdir(my_files, 'sent')
#-------

It is a small point (some may say 'but what's the difference with a
top-level method?'), but let me know your opinion, if you have a chance

Regards,

Raul
--
Posted via http://www.ruby-....

Robert Klemme

11/26/2007 7:35:00 AM

2007/11/25, Raul Parolari <raulparolari@gmail.com>:
> Robert Klemme wrote:
>
> > I would not do that because it is not really an Array operation.
>
> You are right; it does not belong in class Array.
>
> > Having a top level method (aka function) is better IMHO.
>
> Here I use the occasion to share a perplexity I have, when I look at
> Ruby programs, and I sometimes see a myriad of top-level methods. If I
> ask the reason, the answer is a vague "they do not really fit well
> anywhere", which leaves me a bit disconcerted (especially when the
> script is part of a project that may grow, not something written to be
> used once).
>
> In the case discussed, I would place this method in a module (that would
> just need to be required):
>
> module MyUtils
> def self.mv_files_to_subdir(files, subdir)
> files.each do |curr_file|
> ..
> end
> end
> end
>
> or, as this is in fact a file operation, I would consider to reopen
> FileUtils and add the method:
>
> #--
> module FileUtils
> def self.mv_files_to_subdir(files, subdir)
> files.each do |curr_file|
> new_file = "./#{subdir}/#{curr_file}"
> FileUtils.mv(curr_file, new_file) unless File.exists?(new_file)
> end
> end
> end
>
> fglob= %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
> all_files = Dir.glob("*")
> my_files = all_files.grep(fglob)
>
> FileUtils.mv_files_to_subdir(my_files, 'sent')
> #-------
>
> It is a small point (some may say 'but what's the difference with a
> top-level method?'), but let me know your opinion, if you have a chance

Definitively the right way to go if there is a chance that the code is
reused and might create namespace collisions in a larger project. If
if is just used in a simple script I would stick with the top level
method defined inside the script because distributing the code would
be unnecessarily complex in that case. One can refactor later if
needed.

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Mark Woodward

11/26/2007 10:42:00 AM

On Sun, 25 Nov 2007 20:59:46 +1100
Mark Woodward <markonlinux@internode.on.net> wrote:

thanks to all replies. Very informative.
I've benchmarked this and 'get_file_names2' is considerably quicker!
(if I'm benchmarking correctly?)

Could someone explain why? ie what is it about get_file_names2 that
makes it so much quicker? What is it about get_file_names that slows it
down?

def get_file_names
fn=[]
Find.find('./') do |path|
Find.prune if File.basename(path) == 'sent'
curr_file = File.basename(path)
if curr_file =~ FILE_REGEX
fn << curr_file
end
end
fn
end

def get_file_names2
fglob= %r{^(CAL|NCPH|GOH)\d{6}\.xls$}x
all_files = Dir.glob("*")
my_files = all_files.grep(fglob)
end

Benchmark.bm(5) do |timer|
timer.report('get_file_names') {10000.times {get_file_names}}
timer.report('get_file_names2') {10000.times {get_file_names2}}
end

Results:

user system total real
get_file_names 3.040000 0.700000 3.740000 (5.440394)
get_file_names2 0.570000 0.160000 0.730000 (0.997355)

cheers,

--
Mark

Robert Klemme

11/26/2007 12:36:00 PM

2007/11/26, Mark Woodward <markonlinux@internode.on.net>:
> On Sun, 25 Nov 2007 20:59:46 +1100
> Mark Woodward <markonlinux@internode.on.net> wrote:
>
> thanks to all replies. Very informative.
> I've benchmarked this and 'get_file_names2' is considerably quicker!
> (if I'm benchmarking correctly?)
>
> Could someone explain why? ie what is it about get_file_names2 that
> makes it so much quicker? What is it about get_file_names that slows it
> down?

In _2 you do a dir glob in *one* directory only while find recursively
descends a *directory tree*. Even though you prune in some cases it
comes as no surprise that this is slower.

> def get_file_names
> fn=[]
> Find.find('./') do |path|
> Find.prune if File.basename(path) == 'sent'
> curr_file = File.basename(path)
> if curr_file =~ FILE_REGEX
> fn << curr_file
> end
> end
> fn
> end
>
>
> def get_file_names2
> fglob= %r{^(CAL|NCPH|GOH)\d{6}\.xls$}x
> all_files = Dir.glob("*")
> my_files = all_files.grep(fglob)
> end
>
>
>
> Benchmark.bm(5) do |timer|
> timer.report('get_file_names') {10000.times {get_file_names}}
> timer.report('get_file_names2') {10000.times {get_file_names2}}
> end
>
>
> Results:
>
> user system total real
> get_file_names 3.040000 0.700000 3.740000 (5.440394)
> get_file_names2 0.570000 0.160000 0.730000 (0.997355)
>

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Raul Raul

11/26/2007 7:39:00 PM

Mark Woodward wrote:
> I've benchmarked this and 'get_file_names2' is considerably quicker!
>
> Could someone explain why? ie what is it about get_file_names2 that
> makes it so much quicker? What is it about get_file_names that slows it
> down?

Robert Klemme wrote:
> In _2 you do a dir glob in *one* directory only while find recursively
> descends a *directory tree*. Even though you prune in some cases it
> comes as no surprise that this is slower.

Actually, I think that most of the gain in the final solution comes from
this:
a) in the original solution, the result array is built inserting
manually each file that matches the pattern in the array..
b) in the final solution (get_file_names2) not only Find is replaced by
Dir.glob, but the result array is built by grep.

This may sound surprising, but map (collect) and grep are highly
optimized C mechanisms; moreover they can do smart guesses on the size
of the result array (in particular, grep can inmediately conclude that
the size of the result array will be less than the array given to him).
In the manual solution instead, every item added to the array probably
requires some memory reallocation.

Let's verify if this is true, benchmarking 3 variants:

0) get_file_names : Find + manual array build
1) get_file_names1: Dir.glob + manual array build
2) get_file_names2: Dir.glob + grep

# uses Find + manual array build
def get_file_names
fn=[]
Find.find('./') do |path|
Find.prune if File.basename(path) == 'sent'
curr_file = File.basename(path)
if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
fn << curr_file
end
end
fn
end

# uses Dir.glob + manual array build
def get_file_names1
all_files = Dir.glob("*")
my_files.each do |path|
curr_file = File.basename(path)
if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
fn << curr_file
end
end
fn
end

# uses Dir.glob + grep
def get_file_names2
all_files = Dir.glob("*")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
end

# with 10 files matching the pattern
Benchmark.bm(5) do |timer|
timer.report('get_file_names') {10000.times {get_file_names}}
timer.report('get_file_names1') {10000.times {get_file_names}}
timer.report('get_file_names2') {10000.times {get_file_names2}}
end

user system total real
get_file_names 4.680000 3.870000 8.550000 ( 8.567400)
get_file_names1 4.670000 3.870000 8.540000 ( 8.564753)
get_file_names2 0.690000 0.780000 1.470000 ( 1.470924)

The 580% improvement in the final solution comes almost exclusively from
the grep.

---
I add this for Mark:
it is best in any case to avoid the unnecessary 'Find', for several
reasons; the first is that what you need is the simple Dir.glob.

But let me cite a more insidious problem in that code: the idea of
pruning explicitly 'sent'. Even if you think that the only subdirectory
present will be 'sent', what happens if 8 months from now another person
(or yourself, after some more thousands lines of code!) needs to add
another subdir with files matching that pattern (perhaps just to save a
version of them)? will he remember the assumption done months before?

Moreover, as Stefano pointed out at the beginning of the thread, this
would cause subtle file corruptions, as the code is not set up to deal
with the path correctly (a testing nightmare would follow, until
somebody remembers the 'assumption'; some people have seen this hundreds
of times..:-).

So, the best is not to code 'assumptions' in the code; and of course to
use the right tools for the problem at hand: in this case, Dir.glob and
the beautiful grep

Regards

Raul

--
Posted via http://www.ruby-....

Robert Klemme

11/26/2007 8:35:00 PM

On 26.11.2007 20:39, Raul Parolari wrote:
> Mark Woodward wrote:
>> I've benchmarked this and 'get_file_names2' is considerably quicker!
>>
>> Could someone explain why? ie what is it about get_file_names2 that
>> makes it so much quicker? What is it about get_file_names that slows it
>> down?
>
> Robert Klemme wrote:
>> In _2 you do a dir glob in *one* directory only while find recursively
>> descends a *directory tree*. Even though you prune in some cases it
>> comes as no surprise that this is slower.
>
> Actually, I think that most of the gain in the final solution comes from
> this:
> a) in the original solution, the result array is built inserting
> manually each file that matches the pattern in the array..
> b) in the final solution (get_file_names2) not only Find is replaced by
> Dir.glob, but the result array is built by grep.
>
> This may sound surprising, but map (collect) and grep are highly
> optimized C mechanisms; moreover they can do smart guesses on the size
> of the result array (in particular, grep can inmediately conclude that
> the size of the result array will be less than the array given to him).
> In the manual solution instead, every item added to the array probably
> requires some memory reallocation.
>
> Let's verify if this is true, benchmarking 3 variants:
>
> 0) get_file_names : Find + manual array build
> 1) get_file_names1: Dir.glob + manual array build
> 2) get_file_names2: Dir.glob + grep
>
> # uses Find + manual array build
> def get_file_names
> fn=[]
> Find.find('./') do |path|
> Find.prune if File.basename(path) == 'sent'
> curr_file = File.basename(path)
> if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
> fn << curr_file
> end
> end
> fn
> end
>
> # uses Dir.glob + manual array build
> def get_file_names1
> all_files = Dir.glob("*")
> my_files.each do |path|
> curr_file = File.basename(path)
> if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
> fn << curr_file
> end
> end
> fn
> end
>
> # uses Dir.glob + grep
> def get_file_names2
> all_files = Dir.glob("*")
> my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
> end

What about this one?

def get_file_names_3
Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end

> # with 10 files matching the pattern
> Benchmark.bm(5) do |timer|
> timer.report('get_file_names') {10000.times {get_file_names}}
> timer.report('get_file_names1') {10000.times {get_file_names}}
> timer.report('get_file_names2') {10000.times {get_file_names2}}
> end
>
> user system total real
> get_file_names 4.680000 3.870000 8.550000 ( 8.567400)
> get_file_names1 4.670000 3.870000 8.540000 ( 8.564753)
> get_file_names2 0.690000 0.780000 1.470000 ( 1.470924)
>
> The 580% improvement in the final solution comes almost exclusively from
> the grep.
>
> ---
> I add this for Mark:
> it is best in any case to avoid the unnecessary 'Find', for several
> reasons; the first is that what you need is the simple Dir.glob.
>
> But let me cite a more insidious problem in that code: the idea of
> pruning explicitly 'sent'. Even if you think that the only subdirectory
> present will be 'sent', what happens if 8 months from now another person
> (or yourself, after some more thousands lines of code!) needs to add
> another subdir with files matching that pattern (perhaps just to save a
> version of them)? will he remember the assumption done months before?
>
> Moreover, as Stefano pointed out at the beginning of the thread, this
> would cause subtle file corruptions, as the code is not set up to deal
> with the path correctly (a testing nightmare would follow, until
> somebody remembers the 'assumption'; some people have seen this hundreds
> of times..:-).
>
> So, the best is not to code 'assumptions' in the code; and of course to
> use the right tools for the problem at hand: in this case, Dir.glob and
> the beautiful grep

Good point! That's basically the same reason why it's better to
explicitely open ports in firewall configs than to close "dangerous" ports.

Kind regards

robert

comp.lang.ruby

Moving files matching Regex

Mark Woodward

Stefano Crocco

Mark Woodward

Raul Raul

Robert Klemme

Raul Raul

Robert Klemme

Mark Woodward

Robert Klemme

Raul Raul

Robert Klemme

x Login to ForumsZone