E-mail Service Interrupted For 4,000 Users

E-mail Service Interrupted For 4,000 Users

am 30.03.2007 22:34:40 von spamhotmail

http://www-tech.mit.edu/V127/N10/email.html




E-mail Service Interrupted For 4,000 Users

By Nick Semenkovich

Staff Reporter

Over 4,000 community members lost e-mail access early Wednesday
morning in an outage that still affects some users.


One of MIT's five e-mail servers, po14, crashed sometime before 8 a.m.
on Wednesday, March 7, said Jeffrey I. Schiller '79, MIT Network
Manager for Information Services and Technology.

Schiller said the problem arose when po14 experienced a kernel panic
(similar to Windows's blue screen of death), triggering an automatic
restart of the mail server. Upon restart, the server detected file
system corruption that required manual repair by IS&T technicians.

By Thursday afternoon, over 3,000 of the 4,000 users had e-mail
restored, though there was a large backlog of incoming messages.

As of Thursday evening, roughly 500 users on po14 were still without e-
mail; Schiller estimated the service would be restored for everyone by
9:30 a.m. Friday.

Because of MIT's redundancy and backups, Schiller said he was "not too
worried about data loss."

On Wednesday, he estimated a maximum of 10 messages across the whole
system would be corrupted, a number he revised to three on Thursday.
Those three messages had likely been saved in regular backups, he
said.

Users on po14 who forward their e-mail to external servers, such as
Gmail, were unaffected by the outage.

While the root cause of the outage is unclear, IS&T's 3DOWN Service
Status page characterized the outage as extremely rare. According to
Schiller, IS&T simply "didn't [fore]see this happening."

IS&T has localized the error to the file system on po14. MIT maintains
a RAID file system on e-mail servers, so that mail messages are
preserved across multiple hard drives to prevent failure.
Unfortunately, something caused a small amount of data corruption on
the RAID system and eventually triggered the kernel panic that caused
po14 to restart, said Schiller.

On reboot, po14 ran the application 'fsck,' which is designed to check
and repair corrupted files. While operating, fsck reads a small amount
of data from every single file on a system. Fsck ran for nearly 24
hours, trying to repair the nearly 27 million files on po14, said
Schiller. "It [was] mind-numbing," he said.

MIT experienced a similar e-mail outage in the first week of May 2003.
During that incident, a bug in the operating system of the mail server
po11 caused file corruption and triggered a file consistency check.

"In that outage, fsck took four hours to run," said Schiller, a fact
he attributed to a smaller quota size of 100 megabytes. The current
mail quota is 1 gigabyte.

Because fsck was taking too long, IS&T halted the program Thursday
morning and switched to "plan B," copying the files from po14 to a
duplicate file system. According to Schiller, po14 mail files are
split into four partitions, each with roughly 1,000 users. Three of
the partitions were intact, allowing roughly 3,000 users to regain e-
mail; one of the partitions was corrupt, requiring manual repairs.

In an e-mail, Jerrold M. Grochow '68, vice president for IS&T
described the outage as "an unacceptable length of time for e-mail to
be unavailable to 20 percent of our community." Grochow also outlined
a project to provide "completely redundant" mail service that began in
early 2006 with the goal of completion in Summer 2007.

Schiller said plans to upgrade e-mail would be finalized in the next
week, but considered complete redundancy extremely difficult to
attain. One option under consideration is to break apart MIT's five
large mail servers into 40 or more servers, so that an outage would
impact fewer users and could be repaired more quickly.

Outside services such as Google have offered to run MIT's e-mail, but
Schiller is wary of the security and privacy of such services. "Whose
mail is it anyway?" he asked.

Michael McGraw-Herdeg contributed to the reporting of this article.

------------------------------------------------------------ ------------

This story was published on 2007-03-09.
Volume 127, Number 10