Robert Klemme
11/26/2007 8:35:00 PM
On 26.11.2007 20:39, Raul Parolari wrote:
> Mark Woodward wrote:
>> I've benchmarked this and 'get_file_names2' is considerably quicker!
>>
>> Could someone explain why? ie what is it about get_file_names2 that
>> makes it so much quicker? What is it about get_file_names that slows it
>> down?
>
> Robert Klemme wrote:
>> In _2 you do a dir glob in *one* directory only while find recursively
>> descends a *directory tree*. Even though you prune in some cases it
>> comes as no surprise that this is slower.
>
> Actually, I think that most of the gain in the final solution comes from
> this:
> a) in the original solution, the result array is built inserting
> manually each file that matches the pattern in the array..
> b) in the final solution (get_file_names2) not only Find is replaced by
> Dir.glob, but the result array is built by grep.
>
> This may sound surprising, but map (collect) and grep are highly
> optimized C mechanisms; moreover they can do smart guesses on the size
> of the result array (in particular, grep can inmediately conclude that
> the size of the result array will be less than the array given to him).
> In the manual solution instead, every item added to the array probably
> requires some memory reallocation.
>
> Let's verify if this is true, benchmarking 3 variants:
>
> 0) get_file_names : Find + manual array build
> 1) get_file_names1: Dir.glob + manual array build
> 2) get_file_names2: Dir.glob + grep
>
> # uses Find + manual array build
> def get_file_names
> fn=[]
> Find.find('./') do |path|
> Find.prune if File.basename(path) == 'sent'
> curr_file = File.basename(path)
> if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
> fn << curr_file
> end
> end
> fn
> end
>
> # uses Dir.glob + manual array build
> def get_file_names1
> all_files = Dir.glob("*")
> my_files.each do |path|
> curr_file = File.basename(path)
> if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
> fn << curr_file
> end
> end
> fn
> end
>
> # uses Dir.glob + grep
> def get_file_names2
> all_files = Dir.glob("*")
> my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
> end
What about this one?
def get_file_names_3
Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end
> # with 10 files matching the pattern
> Benchmark.bm(5) do |timer|
> timer.report('get_file_names') {10000.times {get_file_names}}
> timer.report('get_file_names1') {10000.times {get_file_names}}
> timer.report('get_file_names2') {10000.times {get_file_names2}}
> end
>
> user system total real
> get_file_names 4.680000 3.870000 8.550000 ( 8.567400)
> get_file_names1 4.670000 3.870000 8.540000 ( 8.564753)
> get_file_names2 0.690000 0.780000 1.470000 ( 1.470924)
>
> The 580% improvement in the final solution comes almost exclusively from
> the grep.
>
> ---
> I add this for Mark:
> it is best in any case to avoid the unnecessary 'Find', for several
> reasons; the first is that what you need is the simple Dir.glob.
>
> But let me cite a more insidious problem in that code: the idea of
> pruning explicitly 'sent'. Even if you think that the only subdirectory
> present will be 'sent', what happens if 8 months from now another person
> (or yourself, after some more thousands lines of code!) needs to add
> another subdir with files matching that pattern (perhaps just to save a
> version of them)? will he remember the assumption done months before?
>
> Moreover, as Stefano pointed out at the beginning of the thread, this
> would cause subtle file corruptions, as the code is not set up to deal
> with the path correctly (a testing nightmare would follow, until
> somebody remembers the 'assumption'; some people have seen this hundreds
> of times..:-).
>
> So, the best is not to code 'assumptions' in the code; and of course to
> use the right tools for the problem at hand: in this case, Dir.glob and
> the beautiful grep
Good point! That's basically the same reason why it's better to
explicitely open ports in firewall configs than to close "dangerous" ports.
Kind regards
robert