Finding duplicate files
am 02.04.2008 23:30:27 von chrisv
Hello,
I have a very large directory structure which I need to copy to a
Windows server. Unfortunately there are several directories which have
multiple files which have the same name but different case which is
obviously not going to be tolerated by Windows. I need to make a list
of all of these files so that the user can determine whether the
duplicates need to be moved, renamed, or deleted. I've searched around
but I can't find a script that will do this and I'm not very good with
regular expressions or recursion =)
The directory structure in question resides on a Fedora Core 4 server.
Any help would be greatly appreciated.
Re: Finding duplicate files
am 03.04.2008 01:23:21 von Rikishi 42
On 2008-04-02, ChrisV wrote:
> I have a very large directory structure which I need to copy to a
> Windows server. Unfortunately there are several directories which have
> multiple files which have the same name but different case which is
> obviously not going to be tolerated by Windows. I need to make a list
> of all of these files so that the user can determine whether the
> duplicates need to be moved, renamed, or deleted. I've searched around
> but I can't find a script that will do this and I'm not very good with
> regular expressions or recursion =)
>
> The directory structure in question resides on a Fedora Core 4 server.
>
> Any help would be greatly appreciated.
I have a ready-made python script, called xdoubles. Feel free to pick it up
on http://www.rikishi42.net/SkunkWorks/Junk/.
If python is not available, you can still use the md5 key approach; how
about:
find /start_dir/ -type f -exec md5sum '{}' ';' > md5_list.txt
Just sort the list, and run it trough uniq to display only the doubles:
cat md5_list.txt | sort | uniq -D
.... should do it.
PS: If you only problem is a difference in case, and you're sure the only
difference between the files is the case (not the content), then do't worry.
Just copy the whole bunch. As Windows doesn't really care about case, the
second file will just overwrite the first. Just enable overwrite, in the
copy.
--
There is an art, it says, or rather, a knack to flying.
The knack lies in learning how to throw yourself at the ground and miss.
Douglas Adams
Re: Finding duplicate files
am 03.04.2008 03:45:06 von Janis Papanagnou
ChrisV wrote:
> Hello,
>
> I have a very large directory structure which I need to copy to a
> Windows server. Unfortunately there are several directories which have
> multiple files which have the same name but different case which is
> obviously not going to be tolerated by Windows. I need to make a list
> of all of these files so that the user can determine whether the
> duplicates need to be moved, renamed, or deleted. I've searched around
> but I can't find a script that will do this and I'm not very good with
> regular expressions or recursion =)
>
> The directory structure in question resides on a Fedora Core 4 server.
>
> Any help would be greatly appreciated.
I cannot tell about the find command you have on WinDOS, but in case
you've got installed Cygwin on your WinDOS (or have MKS installed)
and make sure the respective find command is listed first in PATH
(i.e. before the WinDOS find), then you can just get the recursive
directory structure from any current working directory by
find . -type f | sort >all_win_files
(The sorting may also be a separate step performed on all_win_files
after transferring it to the Unix box.)
The same on the Unix box
find . -type f | sort >all_win_files
And finally compare with case ignored to see the differences
diff -i all_win_files all_win_files >differences
You can also use comm instead of diff (see "man comm" for details)
and suppress the duplicate files by specifying options -1, -2, or -3.
My comm program doesn't seem to support case-insensitive comparison,
so it may be necessary to use a tr command to convert case
find . -type f | tr 'A-Z' 'a-z' | sort >file_list
Once you have the file listing you can use that information to build
a tar archive or zip file from the files to be copied.
In case you have a modern shell that's what you can do...
On WinDOS:
find . -type f >winfiles # then transfer file winfiles to Unix
On Unix:
comm -3 <( find . -type f | tr 'A-Z' 'a-z' | sort ) \
<( cat winfiles | tr 'A-Z' 'a-z' | sort ) |
xargs tar -cvf tarfile.tar # then transfer to WinDOS
Note: cat is generally to avoid in that context, but for clarity I've
done it that way.
Note: In case the file list is large you may want to use tar's append
function (option -A with GNU tar) instead of -c.
(All programms untested.)
Janis
Re: Finding duplicate files
am 03.04.2008 20:17:28 von PK
ChrisV wrote:
> I have a very large directory structure which I need to copy to a
> Windows server. Unfortunately there are several directories which have
> multiple files which have the same name but different case which is
> obviously not going to be tolerated by Windows. I need to make a list
> of all of these files so that the user can determine whether the
> duplicates need to be moved, renamed, or deleted. I've searched around
> but I can't find a script that will do this and I'm not very good with
> regular expressions or recursion =)
>
> The directory structure in question resides on a Fedora Core 4 server.
If renaming the files beforehand is acceptable, you could scan your tree and
rename files that would clash on windows with some significant and visually
outstanding suffix, so that users will immediately see where there are
problems. The following script produces a shell script which, when run on
the linux server, renames the files as described above.
Here's an example, assuming no file name contains ', and checks only for
duplicate files (not directories).
$ ls
AAA AAa Aaa CCC aAA aaA aaa bbb ccc ddd
$ find . -type f | awk -F '/' -v OFS="/" -v sq="'" '{
if (tolower($0) in a) {
o=$0;
$NF=$NF sprintf("-CHECK_THIS_ONE-%03d",++i[tolower($0)]);
print "mv "sq o sq" "sq $0 sq
} else {
a[tolower($0)]
}
}'
mv './AAa' './AAa-CHECK_THIS_ONE-001'
mv './Aaa' './Aaa-CHECK_THIS_ONE-002'
mv './aAA' './aAA-CHECK_THIS_ONE-003'
mv './aaA' './aaA-CHECK_THIS_ONE-004'
mv './aaa' './aaa-CHECK_THIS_ONE-005'
mv './ccc' './ccc-CHECK_THIS_ONE-001'
After you run the generated script, you can safely copy everything to
windows, and instruct users to look for files with "Check_this_one" (or any
other string you choose, for that matter) in the name.
--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.