June 23, 2008

After Four-Day Streak, Unforseen University E-mail Outage Ends

Print More

A known computer “bug” in Cornell’s e-mail servers triggered an unexpected and widespread e-mail outage last week that left many users of the University’s email services unable to send or receive e-mail. It caused irreversible damage to about 3,800 e-mail accounts, according to CIT.

After nearly a week of problems, the University’s e-mail infrastructure was, for the most part, back to an operational state, according to Rick MacDonald ’71, director of Systems and Operations for CIT.

The University believes that a bug in its e-mail servers that spontaneously reboots the disk arrays after 994 days of continuous operation caused the failure.

Cornell and Sun Microsystems, the vendor of its servers, were fully aware of the bug prior to the outage, MacDonald said. CIT performed maintenance earlier this month on the disk arrays in accordance with directions from Sun Microsystems in an attempt to fix the bug. The University believed at the time that it had averted its potential effects.

Around noon on June 15, which was the 994th day of continuous uptime for the servers, eight Sun 6120 disk arrays underwent what MacDonald referred to as a “severe hardware crash.”

“On Monday evening, we were successful in bringing most of the postoffices back up, only to have them begin crashing due to operating system panics,” MacDonald said.

Over the next few days, CIT continued the time-consuming process of checking for and repairing inconsistencies on the hard disks in order to bring them back into operation, he said.

According to MacDonald, the process took several days because of the size of the files involved, which totaled about eight to 10 terabytes of data.

“The disk arrays were affected differently by these events, which is why some postoffices were available sooner than others,” MacDonald explained. “The array hosting Postoffice 7 sustained so much damage that it had to be replaced by a different array.”

The University will be notifying approximately 3,800 users of Postoffices 7 and 8 that some of their messages — those received between the backup taken at approximately 8 p.m. on Saturday, June 14 through approximately 12:30 p.m. on Sunday, June 15 — were irretrievably lost as a result of the failure, according to an update on CIT’s website on Friday.

“It is clear that we cannot tolerate the loss of what has become our main communication channel,” Polley Ann McClure, vice president for Information Technologies, said in a statement last week about the outage.

She also said that the results of the after-action reviews of CIT and Sun Microsystems would be made available to the Cornell community when they are completed.

MacDonald said that the problems caused by the Sun Microsystems servers would not affect future interactions with the company.

“We’ve had a long and constructive relationship with Sun [Microsystems],” MacDonald said. “I don’t expect any material changes in regards to our relationship with Sun.”

MacDonald said that the University already had several projects underway that would change some of its e-mail services and that those projects will continue as planned, independent of last week’s events.

First, the University is embarking on a project to move staff e-mail accounts on the Cyrus Postoffice to a Microsoft Exchange server.

He also said Cornell hopes to offer students the option of using either its own e-mail server or a third-party server by this coming spring semester. The University is in negotiations with Microsoft and Google to set up those student accounts, he said.

MacDonald also said that Sun Microsystems is performing a root-cause analysis of the problems to definitively determine the cause of last week’s failure. In the meantime, CIT is offering a technical explanation of the e-mail outage to members of the Cornell community with a valid NetID by clicking here.