Redundant Attachment Detection
am 22.02.2006 14:46:36 von iAgentHi,
I have got one idea which I don't know is really implemented in any
mail server/client technology.
Here is the idea:
Many-a-times same attachment is forwarded and/or mass mailed to
millions of users all over the world. If suppose a 1 MB attachment got
chain forwarded across even 50,000 users, it will waste around 50 GB of
space.
What if the mail technology is such that whenever the same attachment
is forwarded within the same mail domain like gmail, yahoo, msn etc it
is not given separate space but a link to a location where that
attachment is stored. The file will not be deleted until the number of
owners of that file become zero.
In this way, a lot of storage space can be saved.
Now there are 2-3 points to be considered about the practicality of the
above idea.
1. How can we detect duplicate attachments?
Catch them young!!
When a user forwards an email, keep track of whether he has
forwarded the attachment file as it is or not. Client will have to keep
track of this.
Some checksum or other hashing mechanism
They are not 100 % reliable theoretically.
2. What is the proportion of forwards that travel within the same
domain and also the size of the attachments usually mass-mailed and
mass-forwarded (just to get an approximation of how much space can be
saved and utility of the idea itself)
3. this is in extension of above point. private mail servers that are
maintained within corporates can be made less space hungry and
cost-effective. In such organisation, mass-mailing is an everyday
phenomenon. (notices, announcements, documents, CCs done for
communication within teams, departments etc.)
I searched for related stuff on google and came across something called
file-aware differencing technology used for reducing bandwidth misuse.
But whether it is being used in mail servers in their storage
technology is my main concern.
Please throw some light on it.