Chuck Remes
2/24/2006 12:36:00 AM
I'm very new to Ruby (as in, just started yesterday). As a learning
exercise I decided to write a short program that would traverse a
directory tree and take note of all duplicate files in that tree.
I'm using a hash of arrays to track all references. The key is the
filename and I push the path as an array to store the value. If I run
across a filename for which a key already exists in the hash, I do a
deeper equality check to see if they are really the same file or if
they are different. The "deeper check" is comparing file sizes.
If they are the same, I push this new path into my array of arrays so
I can check against it if I find yet another file with that name. If
they're different, I push this new path onto a SECOND hash of arrays.
This is where I have trouble. As soon as I push any value into this
second hash, it takes on the identity of the first hash. I don't
understand why because I am not doing any explicit operation to make
hash1 = hash2. Maybe it's a side effect of some other operation.
Anyway, enough talk... the code is below.
I appreciate any and all insight.
cr
--- code here ---
#!/usr/bin/env ruby
require 'find'
h = Hash.new { |h,k| h[k] = [] }
duplicates = Hash.new { |h,k| h[k] = [] }
working_path = ARGV[0] || ENV["PWD"]
Find.find(working_path) do |path|
# if it's a dir, skip to the next path
if File.directory?(path)
next
end
file = File.basename(path)
# if this key doesn't exist in the hash, add it
if h.has_key?(file) == false
h[file].push([path])
else # key already exists in hash
# add file size to hash unless it was already grabbed
h[file].push([path])
h[file].each do |subarray|
subarray[1] = File.size(subarray[0]) unless subarray[1]
end
# now compare the current file's size to the prior check
h[file].each do |subarray|
puts "subarray[0] = #{subarray[0]} and path = #{path}"
if subarray[0].eql?(path) == false && subarray[1] == File.size
(path)
# add to dupe hash
puts "DUPLICATE DUPLICATE DUPLICATE DUPLICATE"
puts "DUP BEFORE h.id = #{h.object_id} and duplicates.id = #
{duplicates.object_id}"
duplicates[file].push([path]) # at this point "h" and
"duplicates" refer to the same object!
puts "DUP AFTER h.id = #{h.object_id} and duplicates.id = #
{duplicates.object_id}"
end
end
end
end
puts "\n\nThe duplicates are..."
duplicates.each do |key, value|
puts "key = #{key}"
value.each do |a|
print "#{a[0]} #{a[1]} "
end
print "\n"
end