Finding duplicate files

am 02.04.2008 23:30:27 von chrisv

Hello,

I have a very large directory structure which I need to copy to a
Windows server. Unfortunately there are several directories which have
multiple files which have the same name but different case which is
obviously not going to be tolerated by Windows. I need to make a list
of all of these files so that the user can determine whether the
duplicates need to be moved, renamed, or deleted. I've searched around
but I can't find a script that will do this and I'm not very good with
regular expressions or recursion =)

The directory structure in question resides on a Fedora Core 4 server.

Any help would be greatly appreciated.

Re: Finding duplicate files

am 03.04.2008 01:23:21 von Rikishi 42

On 2008-04-02, ChrisV wrote:

> I have a very large directory structure which I need to copy to a
> Windows server. Unfortunately there are several directories which have
> multiple files which have the same name but different case which is
> obviously not going to be tolerated by Windows. I need to make a list
> of all of these files so that the user can determine whether the
> duplicates need to be moved, renamed, or deleted. I've searched around
> but I can't find a script that will do this and I'm not very good with
> regular expressions or recursion =)
>
> The directory structure in question resides on a Fedora Core 4 server.
>
> Any help would be greatly appreciated.

I have a ready-made python script, called xdoubles. Feel free to pick it up
on http://www.rikishi42.net/SkunkWorks/Junk/.

If python is not available, you can still use the md5 key approach; how
about:

find /start_dir/ -type f -exec md5sum '{}' ';' > md5_list.txt

Just sort the list, and run it trough uniq to display only the doubles:
cat md5_list.txt | sort | uniq -D

.... should do it.

PS: If you only problem is a difference in case, and you're sure the only
difference between the files is the case (not the content), then do't worry.
Just copy the whole bunch. As Windows doesn't really care about case, the
second file will just overwrite the first. Just enable overwrite, in the
copy.

--
There is an art, it says, or rather, a knack to flying.
The knack lies in learning how to throw yourself at the ground and miss.
Douglas Adams

Re: Finding duplicate files

am 03.04.2008 03:45:06 von Janis Papanagnou

ChrisV wrote:
> Hello,
>
> I have a very large directory structure which I need to copy to a
> Windows server. Unfortunately there are several directories which have
> multiple files which have the same name but different case which is
> obviously not going to be tolerated by Windows. I need to make a list
> of all of these files so that the user can determine whether the
> duplicates need to be moved, renamed, or deleted. I've searched around
> but I can't find a script that will do this and I'm not very good with
> regular expressions or recursion =)
>
> The directory structure in question resides on a Fedora Core 4 server.
>
> Any help would be greatly appreciated.

I cannot tell about the find command you have on WinDOS, but in case
you've got installed Cygwin on your WinDOS (or have MKS installed)
and make sure the respective find command is listed first in PATH
(i.e. before the WinDOS find), then you can just get the recursive
directory structure from any current working directory by

find . -type f | sort >all_win_files

(The sorting may also be a separate step performed on all_win_files
after transferring it to the Unix box.)

The same on the Unix box

find . -type f | sort >all_win_files

And finally compare with case ignored to see the differences

diff -i all_win_files all_win_files >differences

You can also use comm instead of diff (see "man comm" for details)
and suppress the duplicate files by specifying options -1, -2, or -3.
My comm program doesn't seem to support case-insensitive comparison,
so it may be necessary to use a tr command to convert case

find . -type f | tr 'A-Z' 'a-z' | sort >file_list

Once you have the file listing you can use that information to build
a tar archive or zip file from the files to be copied.

In case you have a modern shell that's what you can do...

On WinDOS:

find . -type f >winfiles # then transfer file winfiles to Unix

On Unix:

comm -3 <( find . -type f | tr 'A-Z' 'a-z' | sort ) \
<( cat winfiles | tr 'A-Z' 'a-z' | sort ) |
xargs tar -cvf tarfile.tar # then transfer to WinDOS

Note: cat is generally to avoid in that context, but for clarity I've
done it that way.
Note: In case the file list is large you may want to use tar's append
function (option -A with GNU tar) instead of -c.

(All programms untested.)

Janis

Re: Finding duplicate files

am 03.04.2008 20:17:28 von PK

ChrisV wrote:

> I have a very large directory structure which I need to copy to a
> Windows server. Unfortunately there are several directories which have
> multiple files which have the same name but different case which is
> obviously not going to be tolerated by Windows. I need to make a list
> of all of these files so that the user can determine whether the
> duplicates need to be moved, renamed, or deleted. I've searched around
> but I can't find a script that will do this and I'm not very good with
> regular expressions or recursion =)
>
> The directory structure in question resides on a Fedora Core 4 server.

If renaming the files beforehand is acceptable, you could scan your tree and
rename files that would clash on windows with some significant and visually
outstanding suffix, so that users will immediately see where there are
problems. The following script produces a shell script which, when run on
the linux server, renames the files as described above.

Here's an example, assuming no file name contains ', and checks only for
duplicate files (not directories).

$ ls
AAA AAa Aaa CCC aAA aaA aaa bbb ccc ddd
$ find . -type f | awk -F '/' -v OFS="/" -v sq="'" '{
if (tolower($0) in a) {
o=$0;
$NF=$NF sprintf("-CHECK_THIS_ONE-%03d",++i[tolower($0)]);
print "mv "sq o sq" "sq $0 sq
} else {
a[tolower($0)]
}
}'
mv './AAa' './AAa-CHECK_THIS_ONE-001'
mv './Aaa' './Aaa-CHECK_THIS_ONE-002'
mv './aAA' './aAA-CHECK_THIS_ONE-003'
mv './aaA' './aaA-CHECK_THIS_ONE-004'
mv './aaa' './aaa-CHECK_THIS_ONE-005'
mv './ccc' './ccc-CHECK_THIS_ONE-001'

After you run the generated script, you can safely copy everything to
windows, and instruct users to look for files with "Check_this_one" (or any
other string you choose, for that matter) in the name.

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.