Best way to compare to files in Perl

am 23.09.2010 21:29:26 von fzarabozo

Hello All,

I have thousands of files that I need to analyze with Perl and discard any
duplicates. I also need to implement a way to *not* save on disk any file
that a visitor uploads on the website in the case it's a file we already
have on disk.

So, I need to compare files and have some kind of identifiers in a database
that can help me quickly identify when a duplicate file is received (so
comparing the whole files against each file in the server in every upload is
not really an option since it could take forever). I've heard a little about
CRC and checksum (about how you can obtain a little identifier/result that
can be stored in the DB) but I'm not really sure how to use it in Perl for
file comparition and if that's the best way to do this.

Someone told me that CRC can sometimes make you believe it's a duplicate
when it's not (that it can give you the same result with two different
files), and I need to be 100% certain that a file is not a duplicate of
another already on the server.

Can you guys please give me some advice on how to do this and maybe point me
to the right modules?

Thanks a lot! :-)

Francisco

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: Best way to compare to files in Perl

am 23.09.2010 21:41:04 von Jeff Saxton

this has nothing to do with perl.

there is no practical way to "be 100% certain that a file is not a duplicate of
another already on the server"

use a strong checksum like sha256 and you'll be fine.

________________________________________
From: activeperl-bounces@listserv.ActiveState.com [activeperl-bounces@listserv.ActiveState.com] On Behalf Of Francisco Zarabozo [fzarabozo@hotmail.com]
Sent: Thursday, September 23, 2010 12:29 PM
To: Active State Perl Mailing List
Subject: Best way to compare to files in Perl

Hello All,

I have thousands of files that I need to analyze with Perl and discard any
duplicates. I also need to implement a way to *not* save on disk any file
that a visitor uploads on the website in the case it's a file we already
have on disk.

So, I need to compare files and have some kind of identifiers in a database
that can help me quickly identify when a duplicate file is received (so
comparing the whole files against each file in the server in every upload is
not really an option since it could take forever). I've heard a little about
CRC and checksum (about how you can obtain a little identifier/result that
can be stored in the DB) but I'm not really sure how to use it in Perl for
file comparition and if that's the best way to do this.

Someone told me that CRC can sometimes make you believe it's a duplicate
when it's not (that it can give you the same result with two different
files), and I need to be 100% certain that a file is not a duplicate of
another already on the server.

Can you guys please give me some advice on how to do this and maybe point me
to the right modules?

Thanks a lot! :-)

Francisco

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: Best way to compare to files in Perl

am 23.09.2010 21:45:03 von Ken Cornetet

Your requirements are impossible to fulfill.

Think about this for a minute. There are an infinite possible number of input files, but only a finite number of digests or checksums of any given fixed length. Hence, no way to make this work.

That said, in practical terms if you store the length of each existing file, its MD5 digest, and its SHA1 digest, you can be pretty sure you'll never reject a non-duplicate file. Pretty sure, but not 100% positive.

-----Original Message-----
From: activeperl-bounces@listserv.ActiveState.com [mailto:activeperl-bounces@listserv.ActiveState.com] On Behalf Of Francisco Zarabozo
Sent: Thursday, September 23, 2010 3:29 PM
To: Active State Perl Mailing List
Subject: Best way to compare to files in Perl

Hello All,

I have thousands of files that I need to analyze with Perl and discard any
duplicates. I also need to implement a way to *not* save on disk any file
that a visitor uploads on the website in the case it's a file we already
have on disk.

So, I need to compare files and have some kind of identifiers in a database
that can help me quickly identify when a duplicate file is received (so
comparing the whole files against each file in the server in every upload is
not really an option since it could take forever). I've heard a little about
CRC and checksum (about how you can obtain a little identifier/result that
can be stored in the DB) but I'm not really sure how to use it in Perl for
file comparition and if that's the best way to do this.

Someone told me that CRC can sometimes make you believe it's a duplicate
when it's not (that it can give you the same result with two different
files), and I need to be 100% certain that a file is not a duplicate of
another already on the server.

Can you guys please give me some advice on how to do this and maybe point me
to the right modules?

Thanks a lot! :-)

Francisco

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: Best way to compare to files in Perl

am 23.09.2010 21:46:28 von eroode

> I have thousands of files that I need to analyze with Perl and discard
any
> duplicates. I also need to implement a way to *not* save on disk any
file
> that a visitor uploads on the website in the case it's a file we
already
> have on disk.
>
> So, I need to compare files and have some kind of identifiers in a
database
> that can help me quickly identify when a duplicate file is received
(so
[...]

See the Digest module (http://search.cpan.org/perldoc?Digest).

> Someone told me that CRC can sometimes make you believe it's a
duplicate
> when it's not (that it can give you the same result with two different

> files), and I need to be 100% certain that a file is not a duplicate
of
> another already on the server.

It's true; for a 256-bit digest (like SHA-256), there's something like
a 1 in 2^256 chance that a file will have the same digest as some other
file. It's probably a chance you can live with.

-- Eric

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: Best way to compare to files in Perl

am 23.09.2010 21:55:34 von marms

I want to second this recommendation. I wrote a script that recursively descends and writes out the MD5, SHA1, file length, and file path. Using those first three parameters *in combination* is darn close to 100% for determining file uniqueness. I have never come across two files that differ but still have the same

$MD5 . $SHA1 . $LENGTH

(had to throw in some Perl :-)

--
Mike Arms

-----Original Message-----
From: activeperl-bounces@listserv.ActiveState.com [mailto:activeperl-bounces@listserv.ActiveState.com] On Behalf Of Ken Cornetet
Sent: Thursday, September 23, 2010 1:45 PM
To: Francisco Zarabozo; Active State Perl Mailing List
Subject: RE: Best way to compare to files in Perl

Your requirements are impossible to fulfill.

Think about this for a minute. There are an infinite possible number of input files, but only a finite number of digests or checksums of any given fixed length. Hence, no way to make this work.

That said, in practical terms if you store the length of each existing file, its MD5 digest, and its SHA1 digest, you can be pretty sure you'll never reject a non-duplicate file. Pretty sure, but not 100% positive.

-----Original Message-----
From: activeperl-bounces@listserv.ActiveState.com [mailto:activeperl-bounces@listserv.ActiveState.com] On Behalf Of Francisco Zarabozo
Sent: Thursday, September 23, 2010 3:29 PM
To: Active State Perl Mailing List
Subject: Best way to compare to files in Perl

Hello All,

I have thousands of files that I need to analyze with Perl and discard any
duplicates. I also need to implement a way to *not* save on disk any file
that a visitor uploads on the website in the case it's a file we already
have on disk.

So, I need to compare files and have some kind of identifiers in a database
that can help me quickly identify when a duplicate file is received (so
comparing the whole files against each file in the server in every upload is
not really an option since it could take forever). I've heard a little about
CRC and checksum (about how you can obtain a little identifier/result that
can be stored in the DB) but I'm not really sure how to use it in Perl for
file comparition and if that's the best way to do this.

Someone told me that CRC can sometimes make you believe it's a duplicate
when it's not (that it can give you the same result with two different
files), and I need to be 100% certain that a file is not a duplicate of
another already on the server.

Can you guys please give me some advice on how to do this and maybe point me
to the right modules?

Thanks a lot! :-)

Francisco

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: Best way to compare to files in Perl

am 23.09.2010 22:16:45 von Jan Dubois

On Thu, 23 Sep 2010, Arms, Mike wrote:
>
> I want to second this recommendation. I wrote a script that
> recursively descends and writes out the MD5, SHA1, file length, and
> file path. Using those first three parameters *in combination* is darn
> close to 100% for determining file uniqueness. I have never come
> across two files that differ but still have the same
>
> $MD5 . $SHA1 . $LENGTH
>
> (had to throw in some Perl :-)

I do wonder why you needed to combine all three. Having a collision
of the MD5 by itself is extremely unlikely unless someone intentionally
tried to construct a file that has the same MD5 as another one (this
is an MD5 vulnerability, and you should switch to one of the SHA
algorithms if you have to worry about it).

But for random files it would be highly unlikely; statistically it would
take you on average 100 years to find a collision if you checked several
billion files per second continuously.

Concatenating multiple digests will just make your database searches
slower because the index fields are longer, without providing you much
actual benefit.

So I would be really surprised if you had two different files with the
same MD5 on your disk. If you did, how many files did you have in total?

Cheers,
-Jan

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: Best way to compare to files in Perl

am 23.09.2010 22:26:00 von marms

I do this in order to have a pretty high confidence in avoiding even someone's attempt to "pass" a file off as the right file (the purposeful use of a digest vulnerability). The odds of being able to exploit both MD5 and one of the SHA digest algorithms at the same time AND keep the same file length are vanishingly small.

--
Mike Arms

-----Original Message-----
From: Jan Dubois [mailto:jand@activestate.com]
Sent: Thursday, September 23, 2010 2:17 PM
To: Arms, Mike; 'Ken Cornetet'; 'Francisco Zarabozo'; 'Active State Perl Mailing List'
Subject: RE: Best way to compare to files in Perl

On Thu, 23 Sep 2010, Arms, Mike wrote:
>
> I want to second this recommendation. I wrote a script that
> recursively descends and writes out the MD5, SHA1, file length, and
> file path. Using those first three parameters *in combination* is darn
> close to 100% for determining file uniqueness. I have never come
> across two files that differ but still have the same
>
> $MD5 . $SHA1 . $LENGTH
>
> (had to throw in some Perl :-)

I do wonder why you needed to combine all three. Having a collision
of the MD5 by itself is extremely unlikely unless someone intentionally
tried to construct a file that has the same MD5 as another one (this
is an MD5 vulnerability, and you should switch to one of the SHA
algorithms if you have to worry about it).

But for random files it would be highly unlikely; statistically it would
take you on average 100 years to find a collision if you checked several
billion files per second continuously.

Concatenating multiple digests will just make your database searches
slower because the index fields are longer, without providing you much
actual benefit.

So I would be really surprised if you had two different files with the
same MD5 on your disk. If you did, how many files did you have in total?

Cheers,
-Jan

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: Best way to compare to files in Perl

am 24.09.2010 00:33:14 von jwkenne

Once you have used a digest, /if/ you have an apparent duplicate (or more than one), just do a direct check to see whether they're truly equal. Then you can be absolutely certain.

--
John W Kennedy
"Information is light. Information, in itself, about anything, is light."
-- Tom Stoppard. "Night and Day"

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: Best way to compare to files in Perl

am 24.09.2010 02:36:30 von Bill Luebkert

On 9/23/2010 3:33 PM, John W Kennedy wrote:
> Once you have used a digest, /if/ you have an apparent duplicate (or more than one), just do a direct check to see whether they're truly equal. Then you can be absolutely certain.

I agree - if the file isn't excessively large and appears
to be the same length as an existing file, has matching
digest/cksum/whatever checks you make and the frequency
of these checks isn't that great, you can always do an
actual file comparison as the last resort to prove
duplicity or not and be 100% sure.

The number of times you get through all the checks and
actually have to compare the files should be minuscule.
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: Best way to compare to files in Perl

am 25.09.2010 08:38:04 von Jenda Krynicky

Because it's all backwards!
Why is that?
Because it's hard to read.
Why?
Please do not top post!

> Francisco Zarabozo
> I have thousands of files that I need to analyze with Perl and discard any
> duplicates. I also need to implement a way to *not* save on disk any file
> that a visitor uploads on the website in the case it's a file we already
> have on disk.
>
> So, I need to compare files and have some kind of identifiers in a database
> that can help me quickly identify when a duplicate file is received (so
> comparing the whole files against each file in the server in every upload is
> not really an option since it could take forever). I've heard a little about
> CRC and checksum (about how you can obtain a little identifier/result that
> can be stored in the DB) but I'm not really sure how to use it in Perl for
> file comparition and if that's the best way to do this.
>
> Someone told me that CRC can sometimes make you believe it's a duplicate
> when it's not (that it can give you the same result with two different
> files), and I need to be 100% certain that a file is not a duplicate of
> another already on the server.

From: Ken Cornetet
> Your requirements are impossible to fulfill.
>
> Think about this for a minute. There are an infinite possible
number
> of input files, but only a finite number of digests or checksums of
> any given fixed length. Hence, no way to make this work.

But of course his requirements can be fulfilled. Think about this for
a minute! He's got the CRCs (or MD5/SHA/... hashes) of the old files.
He computes the CRC/hash of the new file. From time to time he gets a
positive match. All he has to do at that moment is to compare the new
file with a single old one. I would not call that a huge deal.
Comparing with all of them would be too expensive, comparing with one
is not. OK, in the extremely unlikely case that he gets a positive
match on files whose contents are not equal, he'll then have to
compare with two files instead of one. Still no huge deal.

Jenda
===== Jenda@Krynicky.cz === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs