Redundant Attachment Detection

Redundant Attachment Detection

am 22.02.2006 14:46:36 von iAgent

Hi,

I have got one idea which I don't know is really implemented in any
mail server/client technology.


Here is the idea:


Many-a-times same attachment is forwarded and/or mass mailed to
millions of users all over the world. If suppose a 1 MB attachment got
chain forwarded across even 50,000 users, it will waste around 50 GB of

space.


What if the mail technology is such that whenever the same attachment
is forwarded within the same mail domain like gmail, yahoo, msn etc it
is not given separate space but a link to a location where that
attachment is stored. The file will not be deleted until the number of
owners of that file become zero.


In this way, a lot of storage space can be saved.


Now there are 2-3 points to be considered about the practicality of the

above idea.


1. How can we detect duplicate attachments?


Catch them young!!
When a user forwards an email, keep track of whether he has
forwarded the attachment file as it is or not. Client will have to keep

track of this.


Some checksum or other hashing mechanism
They are not 100 % reliable theoretically.


2. What is the proportion of forwards that travel within the same
domain and also the size of the attachments usually mass-mailed and
mass-forwarded (just to get an approximation of how much space can be
saved and utility of the idea itself)


3. this is in extension of above point. private mail servers that are
maintained within corporates can be made less space hungry and
cost-effective. In such organisation, mass-mailing is an everyday
phenomenon. (notices, announcements, documents, CCs done for
communication within teams, departments etc.)


I searched for related stuff on google and came across something called

file-aware differencing technology used for reducing bandwidth misuse.
But whether it is being used in mail servers in their storage
technology is my main concern.


Please throw some light on it.

Re: Redundant Attachment Detection

am 23.02.2006 03:16:05 von DFS

rahulgupta83@gmail.com wrote:

> I have got one idea which I don't know is really implemented in any
> mail server/client technology.

> What if the mail technology is such that whenever the same attachment
> is forwarded within the same mail domain like gmail, yahoo, msn etc it
> is not given separate space but a link to a location where that
> attachment is stored. The file will not be deleted until the number of
> owners of that file become zero.

MIMEDefang (http://www.mimedefang.org) has been able to do that for
years, with its "action_replace_with_url" function (though it expires
stored attachments based on time.)

I believe the Cyrus IMAP server can also detect duplicate messages
and make hard-links rather than copies.

> 1. How can we detect duplicate attachments?

> Catch them young!!
> When a user forwards an email, keep track of whether he has
> forwarded the attachment file as it is or not. Client will have to keep
> track of this.

I don't get it.

> Some checksum or other hashing mechanism
> They are not 100 % reliable theoretically.

SHA1 is good enough for me.

--
David.

Re: Redundant Attachment Detection

am 24.02.2006 05:28:49 von iAgent

Thanks for that enlightening reply. :)