Looking for a CRC algorithm

am 26.04.2006 03:09:30 von mulangi

Hello,

I am looking for a CRC or checksum algorithm with the property that a
small change in the content of the file results in a small change in
the checksum. Typically, most of the checksum computing algorithms such
as MD5, SHA, etc go to the other extreme i.e. a small change in the
file contents results in a large change in the checksum. Does anybody
know of any algorithm with the reverse property?

Thanks in advance.

Re: Looking for a CRC algorithm

am 26.04.2006 03:55:07 von Dik.Winter

In article <1146013770.154721.182210@i40g2000cwc.googlegroups.com> mulangi@gmail.com writes:
> I am looking for a CRC or checksum algorithm with the property that a
> small change in the content of the file results in a small change in
> the checksum. Typically, most of the checksum computing algorithms such
> as MD5, SHA, etc go to the other extreme i.e. a small change in the
> file contents results in a large change in the checksum. Does anybody
> know of any algorithm with the reverse property?

What is the motivation? The intent of such algorithms is to determine
that there has been an error in transmission. As it is much more likely
that there are small changes, it is reasonable to amplify them.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Re: Looking for a CRC algorithm

am 26.04.2006 04:07:07 von cr88192

wrote in message
news:1146013770.154721.182210@i40g2000cwc.googlegroups.com.. .
> Hello,
>
> I am looking for a CRC or checksum algorithm with the property that a
> small change in the content of the file results in a small change in
> the checksum. Typically, most of the checksum computing algorithms such
> as MD5, SHA, etc go to the other extreme i.e. a small change in the
> file contents results in a large change in the checksum. Does anybody
> know of any algorithm with the reverse property?
>

I am not an expert on this, but I think my answer is ok...

usually, the large difference is desired as that makes it much more likely
that different files will have different checksums. or more like: having a
higher probability of uniqueness also tends to result in a higher
probability of randomness in the checksums...

one of the simpler algos which results in similar outputs for similar files
is just adding all the values in the file (often, overflows are either
ignored or wrapped around to the LSB).

next up is keeping 2 accumulators, one for adding up the byte values, and
the other for adding the intermediate values of the first accumulator.

note that in the first case, it is often possible to alter data, in such a
way that the checksum remains the same (changing data in one place, followed
by changing data somewhere else to make the results add up to the same
value). likewise, some things (eg: completely changing around the order of
the data) will still result in the same checksum.

the second style is slightly more resistant to this, but also tends to vary
more between files...

or such...

> Thanks in advance.
>

Re: Looking for a CRC algorithm

am 26.04.2006 04:16:07 von Bit Twister

On 25 Apr 2006 18:09:30 -0700, mulangi@gmail.com wrote:
> Hello,
>
> I am looking for a CRC or checksum algorithm with the property that a
> small change in the content of the file results in a small change in
> the checksum. Typically, most of the checksum computing algorithms such
> as MD5, SHA, etc go to the other extreme i.e. a small change in the
> file contents results in a large change in the checksum.

That is a bit of a security feature. In the past the simple checksum
could be spoofed by adding enough data to end of file to make the
simple checksum come back out the same as the original.

The cracker would add malware to end of file, re-run the checksum,
compute what extra data to add to bring the checksum back to original
value.

It still can be done with the newer checks, but the computational
power/time is much steeper. :)

What does it matter, all you need to know is the data is corrupt if
the MD5/SHA is incorrect.

Re: Looking for a CRC algorithm

am 26.04.2006 04:55:11 von DFS

Re: Looking for a CRC algorithm

am 26.04.2006 06:07:14 von mulangi

The purpose is to detect small changes in a document. Also the original
document will not be available when the information that the document
has changed is required. And I need to answer the question as to
whether the changes are small or large which precludes using a regular
hash or CRC.

Re: Looking for a CRC algorithm

am 26.04.2006 06:09:39 von mulangi

I didn't understand this. Can you elaborate please?

Re: Looking for a CRC algorithm

am 26.04.2006 06:22:07 von DFS

wrote in message
news:1146024434.497153.15510@e56g2000cwe.googlegroups.com...
> The purpose is to detect small changes in a document. Also the original
> document will not be available when the information that the document
> has changed is required. And I need to answer the question as to
> whether the changes are small or large which precludes using a regular
> hash or CRC.
>

interspersed crc

Re: Looking for a CRC algorithm

am 26.04.2006 06:28:14 von DFS

wrote in message
news:1146024579.013532.117100@v46g2000cwv.googlegroups.com.. .
>I didn't understand this. Can you elaborate please?
>

a 16 bit CRC has 2^16-1 distinct numbers

Your word set may be 2,000,000 bits, which is 2^2,000,000-1 distinct numbers

So there are far more words that CRCs, which means the same CRC is used
multiple times for different words

In this case it would be 2,000,000/16 or 125,000 unique words will generate
the same CRC.

All you have to do is find the word set that generates the same CRC, and
there are 16 of those.

Re: Looking for a CRC algorithm

am 26.04.2006 07:57:42 von Jyrki Lahtonen

mulangi@gmail.com wrote:
> Hello,
>
> I am looking for a CRC or checksum algorithm with the property that a
> small change in the content of the file results in a small change in
> the checksum. Typically, most of the checksum computing algorithms such
> as MD5, SHA, etc go to the other extreme i.e. a small change in the
> file contents results in a large change in the checksum. Does anybody
> know of any algorithm with the reverse property?
>
> Thanks in advance.
>

Others have explained why this is a desirable thing for a CRC.
Anyway, have you considered using an error/erasure-correcting code?
When you compute the "syndrome" (a bunch of checksums, really)
then you can compute how many symbols (e.g. characters in a body
of text) were corrupted. Counting the number of errors only works
up to a certain prescribed bound, though. Caution: this is totally
unsafe against malicious alteration of data. It simply protects
against accidentally corrupted bits/bytes.

Buzzwords: error-correcting code, FEC

Cheers,

Jyrki

Re: Looking for a CRC algorithm

am 26.04.2006 11:21:00 von Jasen Betts

On 2006-04-26, mulangi@gmail.com wrote:
> Hello,
>
> I am looking for a CRC or checksum algorithm with the property that a
> small change in the content of the file results in a small change in
> the checksum. Typically, most of the checksum computing algorithms such
> as MD5, SHA, etc go to the other extreme i.e. a small change in the
> file contents results in a large change in the checksum. Does anybody
> know of any algorithm with the reverse property?

count the number of 1 bits in the file

lika all checksum algorthms it won't detect all possible changes but should
have the property you want.

Bye.
Jasen

Re: Looking for a CRC algorithm

am 26.04.2006 16:05:38 von Mark Adler

mulangi@gmail.com wrote:
> I am looking for a CRC or checksum algorithm with the property that a
> small change in the content of the file results in a small change in
> the checksum.

Then use a checksum. That is, simply sum the bytes of the input file
into a large integer. If the integer is large enough to not wrap when
summing, then you will have an unambiguous measure of small changes.
If the sum can wrap, then you can have a very large change possibly
masquerading as a small one.

As an example, an 8-byte checksum would not wrap for files whose
lengths fit in 7 bytes, or 64 petabytes (mega-gigabytes). That should
work for a while.

Note that one byte changing by 50 will look just like 50 bytes changing
by one. But you'll still be able to tell the difference between
wholesale changes and small ones, including small insertions and
deletions in the stream.

mark

Re: Looking for a CRC algorithm

am 26.04.2006 19:00:01 von p.black

On Wednesday 26 April 2006 00:07, mulangi@gmail.com wrote:
> The purpose is to detect small changes in a document. Also the original
> document will not be available when the information that the document
> has changed is required. And I need to answer the question as to
> whether the changes are small or large ...

Find the difference between the original document and new document.
Use the size of the difference, that is, the number of bytes for which
they differ.

There are a number of algorithms to find a small set of changes
between documents (e.g., Unix "diff" program).

-paul-
--
Paul E. Black (p.black@acm.org)

Re: Looking for a CRC algorithm

am 26.04.2006 20:24:31 von David R Tribble

mulangi wrote:
>> I am looking for a CRC or checksum algorithm with the property that a
>> small change in the content of the file results in a small change in
>> the checksum.
>

Mark Adler wrote:
> Then use a checksum. That is, simply sum the bytes of the input file
> into a large integer. If the integer is large enough to not wrap when
> summing, then you will have an unambiguous measure of small changes.
> If the sum can wrap, then you can have a very large change possibly
> masquerading as a small one.

My two cents' worth...

I have used a simple 32-bit hash, where each byte in the stream is
added to the checksum after rotating the previous hash value by n bits,
where n is fairly small (1 or 3 bits). (I believe this hashing scheme
originated at Bell Labs in the early days of C and Unix.)

Presumably using a wider word size (64 or 128 bits) would spread
any differences out across more bits, so that more similar texts
would have correspondingly more similar checksums.

But even this simple scheme might not give you what you want.
I assume you're trying to detect the case of two almost-alike texts
having a difference in checksums that is smaller than the checksum
difference between two radically different texts.

Re: Looking for a CRC algorithm

am 27.04.2006 01:30:43 von Dik.Winter

In article <1146024434.497153.15510@e56g2000cwe.googlegroups.com> mulangi@gmail.com writes:
> The purpose is to detect small changes in a document. Also the original
> document will not be available when the information that the document
> has changed is required. And I need to answer the question as to
> whether the changes are small or large which precludes using a regular
> hash or CRC.

I do not think this is really possible. With a standard method you will
detect a small number of changes, but a large number of changes may remain
undetected. You can not have short information that allows you to detect
all changes, whether they are small or large. Whatever the method, there
will be documents that are (possibly) completely unrelated that will
generate the same checksum, unless the checksum you generate is at least
as large as the document itself.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Re: Looking for a CRC algorithm

am 27.04.2006 01:36:36 von unruh

"Paul E. Black" writes:

>On Wednesday 26 April 2006 00:07, mulangi@gmail.com wrote:
>> The purpose is to detect small changes in a document. Also the original
>> document will not be available when the information that the document
>> has changed is required. And I need to answer the question as to
>> whether the changes are small or large ...

Essentially impossible without having the two documents to compare.

You could do running hashes-- ie hash subsections of the document, to make
one long hash which contains the hash of the subsections together with the
hash of the whole thing.
But that of course does not tell you if the change is small. A single bit
change should compeletely alter the hash. Ie, a hash simply says "it
changed" not how it changed.

There are of course various error correcting codes that could be used.

>Find the difference between the original document and new document.
>Use the size of the difference, that is, the number of bytes for which
>they differ.

>There are a number of algorithms to find a small set of changes
>between documents (e.g., Unix "diff" program).

>-paul-
>--
>Paul E. Black (p.black@acm.org)

Re: Looking for a CRC algorithm

am 27.04.2006 08:52:28 von Terje Mathisen

Dik T. Winter wrote:
> In article <1146024434.497153.15510@e56g2000cwe.googlegroups.com> mulangi@gmail.com writes:
> > The purpose is to detect small changes in a document. Also the original
> > document will not be available when the information that the document
> > has changed is required. And I need to answer the question as to
> > whether the changes are small or large which precludes using a regular
> > hash or CRC.
>
> I do not think this is really possible. With a standard method you will
> detect a small number of changes, but a large number of changes may remain
> undetected. You can not have short information that allows you to detect
> all changes, whether they are small or large. Whatever the method, there
> will be documents that are (possibly) completely unrelated that will
> generate the same checksum, unless the checksum you generate is at least
> as large as the document itself.

The best (only possible?) approach here would seem to be a combination
of a very strong hash, i.e. something like MD-5 or better (maybe AES-256
used in chained mode?), and a very naive short checksum, like a 32-bit
sum of all bytes (or 16-bit/32-bit words?) in the file.

The crypto hash can give you arbitrarily strong guarantees that the file
is indeed unchanged, while the checksum can give a sort of indication if
the change is "big" or "small".

Terje

--
-
"almost all programming can be viewed as an exercise in caching"

Re: Looking for a CRC algorithm

am 27.04.2006 16:56:51 von Ben Rudiak-Gould

Mark Adler wrote:
> Note that one byte changing by 50 will look just like 50 bytes changing
> by one. But you'll still be able to tell the difference between
> wholesale changes and small ones, including small insertions and
> deletions in the stream.

This seems rather unreliable to me. The sum of the bytes of a random file of
length N is very likely to be close to N*127.5, so the chance of a checksum
collision between random files of the same length is rather large. A quick
simulation suggests that there's about one chance in 400 that two sums of
100,000 random bytes will differ by no more than 100. If the bytes are
limited to the ASCII range (0-127) then the probability is about 1 in 200.

-- Ben

Re: Looking for a CRC algorithm

am 27.04.2006 17:39:22 von nmm1

In article ,
Terje Mathisen writes:
|>
|> The best (only possible?) approach here would seem to be a combination
|> of a very strong hash, i.e. something like MD-5 or better (maybe AES-256
|> used in chained mode?), and a very naive short checksum, like a 32-bit
|> sum of all bytes (or 16-bit/32-bit words?) in the file.
|>
|> The crypto hash can give you arbitrarily strong guarantees that the file
|> is indeed unchanged, while the checksum can give a sort of indication if
|> the change is "big" or "small".

Well, vaguely. That assumes that the definition of 'small' corresponds
in some loose sense to something that will lead to a small difference
in the sum. Some big changes (such as randomising the order of the
bytes in the file) will give a change of zero, but some 'small' changes
(such as replacing every LF by CR-LF) will make it very different.

Without a mathematically useful definition of 'small', I don't see that
this is worth attempting.

An interesting measure, that would apply to a fair number of definitions
of 'small', would be:

fred() {
x=`bzip2 < $1 | wc | awk '{print $3;}'`
y=`bzip2 < $2 | wc | awk '{print $3;}'`
z=`cat $1 $2 | bzip2 | wc | awk '{print $3;}'`
echo $x $y $z | awk '{printf "%.2f\n", $3/($1+$2);}'
}

Scale the result to taste.

Regards,
Nick Maclaren.

Re: Looking for a CRC algorithm

am 27.04.2006 17:51:46 von Matt Mahoney

Nick Maclaren wrote:
> An interesting measure, that would apply to a fair number of definitions
> of 'small', would be:
>
> fred() {
> x=`bzip2 < $1 | wc | awk '{print $3;}'`
> y=`bzip2 < $2 | wc | awk '{print $3;}'`
> z=`cat $1 $2 | bzip2 | wc | awk '{print $3;}'`
> echo $x $y $z | awk '{printf "%.2f\n", $3/($1+$2);}'
> }

a.k.a. Keogh's compression dissimilarity measure.

-- Matt Mahoney

Re: Looking for a CRC algorithm

am 27.04.2006 18:49:21 von nmm1

In article <1146153106.763576.7790@j33g2000cwa.googlegroups.com>,
"Matt Mahoney" writes:
|>
|> a.k.a. Keogh's compression dissimilarity measure.

Hmm. Considering its obviousness (if you know of Kolmogorov complexity),
and the fact that it is at least 30 years old, he has done well to get
it named after him.

Regards,
Nick Maclaren.

Re: Looking for a CRC algorithm

am 27.04.2006 20:57:40 von Bryan Olson

Terje Mathisen wrote:
> Dik T. Winter wrote:
>
>> In article <1146024434.497153.15510@e56g2000cwe.googlegroups.com>
>> mulangi@gmail.com writes:
>> > The purpose is to detect small changes in a document. Also the
>> original
>> > document will not be available when the information that the document
>> > has changed is required. And I need to answer the question as to
>> > whether the changes are small or large which precludes using a regular
>> > hash or CRC.
>>
>> I do not think this is really possible. With a standard method you will
>> detect a small number of changes, but a large number of changes may
>> remain
>> undetected. You can not have short information that allows you to detect
>> all changes, whether they are small or large. Whatever the method, there
>> will be documents that are (possibly) completely unrelated that will
>> generate the same checksum, unless the checksum you generate is at least
>> as large as the document itself.
>
>
> The best (only possible?) approach here would seem to be a combination
> of a very strong hash, i.e. something like MD-5 or better (maybe AES-256
> used in chained mode?), and a very naive short checksum, like a 32-bit
> sum of all bytes (or 16-bit/32-bit words?) in the file.

Read Jyrki Lahtonen's solution. The error-correction code (ECC)
methods used in practice will fix all errors up to a certain
number of bits, and with high probability detect random errors
beyond that limit. If we need more reliable detection, we can
add a hash or signature inside the ECC.

> The crypto hash can give you arbitrarily strong guarantees that the file
> is indeed unchanged, while the checksum can give a sort of indication if
> the change is "big" or "small".

No, the checksum doesn't do that; large errors can create small
differences. The ECC + hash method works: upon receipt, we try
to correct the errors and then see if the hash matches the
corrected message.

A couple notes: if you need to defend against deliberate modification,
use a "digital signature" or "message authentication code". Simply
sending a hash with the message doesn't work -- the attacker can
change both the message and the hash.

Order of operations is significant. Whatever of the following we do, we
should apply them in this order before sending:

compress, encrypt, sign|MAC|hash|CRC|checksum, ECC

Obviously, the receiver decodes in the reverse order.

--
--Bryan

Re: Looking for a CRC algorithm

am 28.04.2006 01:21:40 von Mark Adler

Ben Rudiak-Gould wrote:
> This seems rather unreliable to me. The sum of the bytes of a random file of
> length N is very likely to be close to N*127.5, so the chance of a checksum
> collision between random files of the same length is rather large.

Yes, I agree with your analysis. Fortunately, most files are not the
same length. In any case, the original query was not find a way to
identify which files are close to each other among a large set by
looking at their check value, but rather if you know the check value
the file used to have, being able to answer whether has it changed a
lot or a little based on the check value for that file now. Your
analysis of the checksum does affect the case for no insertions or
deletions, where the alteration vector (the arithmetic difference
between the files) tends to be clustered around zero, even for large
random changes.

A simple checksum is one way to try to get a handle on whether it's
changed a little or a lot. However for several reasons, including the
change cancellations in a sum that you point out, as well as the
arithmetic value of a change having a wide range, it's not a very good
way. But it's a lot better than the usual signatures which, by design,
will amplify even one small change in the file into a large change in
the signature.

So does someone know a better way? In case this problem doesn't have a
name, I will now attempt to name it for future reference: "The
approximately proportional signature problem". I.e. come up with a
signature so that by comparing two signatures of a changed file, you
can tell approximately how much of the file has changed. The problem
could be made more explicit by defining a measure of difference for
alterations, insertions, and deletions, and defining "approximately",
but it's probably best to leave it open to interpretation for now to
see what kind of answers pop up.

mark

Re: Looking for a CRC algorithm

am 28.04.2006 03:32:58 von Matt Mahoney

Nick Maclaren wrote:
> In article <1146153106.763576.7790@j33g2000cwa.googlegroups.com>,
> "Matt Mahoney" writes:
> |>
> |> a.k.a. Keogh's compression dissimilarity measure.
>
> Hmm. Considering its obviousness (if you know of Kolmogorov complexity),
> and the fact that it is at least 30 years old, he has done well to get
> it named after him.
>
>
> Regards,
> Nick Maclaren.

Yeah, he does cite earlier applications going back to the 1970's
http://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf

I had the same idea when doing some postdoc research in time series
anomaly detection but then I found the above paper and discovered my
idea was not so new after all.

-- Matt Mahoney

Re: Looking for a CRC algorithm

am 28.04.2006 09:25:12 von nmm1

In article <1146187977.925141.285260@i40g2000cwc.googlegroups.com>,
"Matt Mahoney" writes:
|>
|> I had the same idea when doing some postdoc research in time series
|> anomaly detection but then I found the above paper and discovered my
|> idea was not so new after all.

Of the few thousand techniques I have invented, a short search found
that 90% were already known. Later (well-hidden) data showed that a
further 9% had been. I suspect that 0.9% of the remainder have been,
but I haven't found the references yet ....

As this is normal, it shows why software patents (as currently proposed)
are such a bad idea.

Regards,
Nick Maclaren.

Re: Looking for a CRC algorithm

am 28.04.2006 09:40:51 von glen herrmannsfeldt

Paul E. Black wrote:
(snip relating checksums, CRC's, hash algorithms and document differences.)

> Find the difference between the original document and new document.
> Use the size of the difference, that is, the number of bytes for which
> they differ.

> There are a number of algorithms to find a small set of changes
> between documents (e.g., Unix "diff" program).

The dynamic programming algorithms used by diff were originally
developed for biology, for comparing protein sequences.

-- glen

Re: Looking for a CRC algorithm

am 28.04.2006 09:50:53 von glen herrmannsfeldt

Dik T. Winter wrote:
(snip on detecting small or large changes in a document with a hash,
checksum, or CRC.)

> I do not think this is really possible. With a standard method you will
> detect a small number of changes, but a large number of changes may remain
> undetected. You can not have short information that allows you to detect
> all changes, whether they are small or large. Whatever the method, there
> will be documents that are (possibly) completely unrelated that will
> generate the same checksum, unless the checksum you generate is at least
> as large as the document itself.

That is true, but that doesn't mean you can't find one that has a high
probability of detecting small or large changes in a document.

My thought for a text document is to first apply a hash function
to each line. That will reduce each line to a constant small number
of bits, maybe 16 or 32. If the result is still too big, maybe some
function applied to those hash values.

How about first hash each line to a 16 bit value, such as with a 16
bit CRC function. Next take a 65536 bit string and set the bit
corresponding to each hash value. Most small changes will only
change a small number of bits, most large changes a large number
of bits for some definition of small and large.

For larger files, it is likely that 16 bit hash functions are
too small, but the idea can be extended. If 65536 bits is too
many, just add up the number of ones, again it should not change
too much for small changes, though it is more likely not to change
much for large changes.

-- glen

Re: Looking for a CRC algorithm

am 28.04.2006 10:30:18 von Jasen Betts

On 2006-04-26, mulangi@gmail.com wrote:
> The purpose is to detect small changes in a document. Also the original
> document will not be available when the information that the document
> has changed is required. And I need to answer the question as to
> whether the changes are small or large which precludes using a regular
> hash or CRC.

That's impossible. checkssums (that are smaller than the file) can't even
tell you if a file hasn't been altered.

Bye.
Jasen

Re: Looking for a CRC algorithm

am 29.04.2006 04:30:47 von Scott A Crosby

On 27 Apr 2006 16:21:40 -0700, madler@alumni.caltech.edu writes:

> So does someone know a better way? In case this problem doesn't have a
> name, I will now attempt to name it for future reference: "The
> approximately proportional signature problem". I.e. come up with a
> signature so that by comparing two signatures of a changed file, you
> can tell approximately how much of the file has changed. The problem
> could be made more explicit by defining a measure of difference for
> alterations, insertions, and deletions, and defining "approximately",
> but it's probably best to leave it open to interpretation for now to
> see what kind of answers pop up.

Variable length checksum or fixed length checksum?

If you can accept variable length checksums, break a file into blocks
of, say, 1kb, CRC each of them into 32 bits. The two sequences of
CRC's can then be compared.

If you want a fixed length checksum for files of length X, you could
take the CRC's and put them into a bloom filter, then compare the
bloom filters against each other.

Both of these techniques will handle replacements but won't handle
insertions/deletions. To handle insertions/deletions, use an idea from
the RSYNC algorithm where the block-boundaries are chosen based on
nearby context. For instance choose a given location as a block
boundary when hash(32 preceeding bytes)%1024==0, which occurs every
1024 bytes on average. If the minimum block size is set at 32, this
will create an output sequence of CRC's where each localized change to
the file will provably change at most 2 CRCs.

//

If you want a fixed, small hash value and you know that only a small
number of changes ($k$ bytes) will be made to the file, then you have
another choices, inspired by the 'sum of the bytes' above, but greatly
generalized. Assume the output of the hash is 4*k+3 bytes.

Let $X$ be a sequence of bytes x_0 ... x_{n-1}. Let $f,g$ be any
function from that has at most 8 bits set in its outpuut. Then any
hash function with the construction:

hash1(X) = XOR_{i=0}^{n-1} f(i,x_i)

hash1 has the property that any j-byte modification to its input will
modify at most 8*j bits to the output. We could define f(i,x_i) as:

Compute b= MD5({0,i,x_i})%(8*(4k+3)) and set the b'th bit to the output to 1
Compute b= MD5({1,i,x_i})%(8*(4k+3)) and set the b'th bit to the output to 1
...
Compute b= MD5({7,i,x_i})%(8*(4k+3)) and set the b'th bit to the output to 1

hash2(X) = XOR_{i=0}^{n-1} g(x_i)

hash2 has the property that any j-byte modification/insertion/deletion
will modify at most 8*j bits in the output. However, it is a lot less
robust against other changes, such as transpositions. An example of
such a g is: g(x_i) = f(-1,x_i)

By assuming the output size is 4*k+3, if we detect fewer than 8*k bit
changes in the output, we know with high probability that only k byte
changes were made. Otherwise, we don't know how many changes.

These two mechanisms could also be combined creating a system with
provable properties. Anyone know any related literature on the
subject?

Scott

Re: Looking for a CRC algorithm

am 29.04.2006 20:59:17 von Mark Adler

Scott A Crosby wrote:
> Variable length checksum or fixed length checksum?

Yes, that's an important question. It should be fixed length.
Variable length, i.e. proportional to the length of the file would be
too easy. The most efficient approach for variable length I think
would be a Reed-Solomon code on blocks, from which you could tell
exactly how many bytes were changed up to a limit, or that more than
that were changed in the block. (It would even tell you where the
bytes were and how to change the file back to the original, which hints
that there should be something more efficient to just count the
changes.) This approach would not do well with insertions and
deletions.

The harder problem is how to make a fixed-length check value that
somehow gives a proportional measure of the changes in the file by
comparing only the check values.

I like the idea of rsync-ish markers to help with insertion/deletion
detection.

mark

Re: Looking for a CRC algorithm

am 30.04.2006 01:58:51 von Toby

Jasen Betts wrote:
> On 2006-04-26, mulangi@gmail.com wrote:
> > Hello,
> >
> > I am looking for a CRC or checksum algorithm with the property that a
> > small change in the content of the file results in a small change in
> > the checksum. Typically, most of the checksum computing algorithms such
> > as MD5, SHA, etc go to the other extreme i.e. a small change in the
> > file contents results in a large change in the checksum. Does anybody
> > know of any algorithm with the reverse property?
>
> count the number of 1 bits in the file
>
> lika all checksum algorthms it won't detect all possible changes but should
> have the property you want.

Only superficially. Take text and image files for example - the number
of 1 bits will be roughly proportional to the length of the file. If
that length changes dramatically, the count will too. But you could
replace every byte of a text file with different text (it has
consistent statistical properties) and the count would not deviate far.
Ditto replacing one JPEG with another of similar byte count. Or any
compressed archive with a completely different compressed archive of
similar size. The contents are completely different but the count would
be very similar. In sum, when replacing with data with similar
statistical properties, the correlation is with the length of the file,
rather than the content.

>
> Bye.
> Jasen

Re: Looking for a CRC algorithm

am 01.05.2006 10:00:21 von gcomp.20.RasmusMoller

This may be too elaborate, but anyway, it is (long) fixed length and
does incorporate a "distance" measure:

Use a combination of

one normal CRC for the whole file to detect if changed at all,
256 counts N(c) (wrapped around) of occurrence of each ASCII value c

As a measure of distance, you view the counts as a vector,
so distance would be
the square root of ( (delta(N(1)))^2 + ... + (delta(N(255)))^2 )

This relative distance measure works fine for insertions/deletions, but
it does not measure mere permutations (their distance will be zero),
however mere permutations ARE caught in the first place by the CRC.

Rasmus M=F8ller

Re: Looking for a CRC algorithm

am 01.05.2006 19:49:53 von ArkGunSlinger

Are you writing an application for a specific plaftorm or does this
algorithm need to be portable?
Are you watching files on a file system or are you transmitting them?

Re: Looking for a CRC algorithm

am 01.05.2006 20:50:12 von Scott A Crosby

On 29 Apr 2006 11:59:17 -0700, madler@alumni.caltech.edu writes:

> Yes, that's an important question. It should be fixed length.
> Variable length, i.e. proportional to the length of the file would be
> too easy.

> The harder problem is how to make a fixed-length check value that
> somehow gives a proportional measure of the changes in the file by
> comparing only the check values.

I guess a property you need is:

The length of the code should be proportional to the number of changes
(or the number of localized changes) you want to detect.

This could mean either a fixed-length checksum that'll count you a
fixed number of localized changes, or a variable-length checksum that
will count you, say, 1% of changes.

This seems intuitive, given information theory. The counting theorem
shows that a 64-bit checksum must create huge equivalence sets for
files of 1024 bits. 'Most' files with >64 bits of modification would
effectively be mapped to checksums that have to be meaningless as far
as counting differences are concerned.

Scott