September 4, 2008

Response to E-mail Outage Inadequate, Report Finds

Print More

The University’s response to a massive e-mail outage in June was inadequate and caused significant interruption to the business of the University, according to a report released last month.
The unexpected and widespread e-mail failure prevented some members of the Cornell community from sending or receiving e-mail for as long as five days and caused irreversible damage to about 3,800 email accounts.
The report summarizes the findings of a group of Cornell officials throughout campus who performed an “after action review” of the incident.
Members of the group sharply criticized the manner in which the University communicated with the Cornell community and the public during the outage, calling it ineffective.
The University used the CIT “Network Status” web page and broadcast messages across its internal telephone system to provide information during the outage. The report found these methods to be too slow and incomprehensive.
“Communication has to be accurate,” the report stated. “CIT posted messages indicating problems were fixed when people on Postoffice 7 were still not able to receive mail.”
The report urged the University to find more expansive alert systems, and provide timely information that is written in language that all members of the community — not only those with computer backgrounds — will understand during similar incidents in the future.
The group also said that the incident showed a “lack of clear metrics on when you’ve gone from an incident to an emergency.”
In addition to making recommendations, the group also sought to gauge the extent of the outage’s impact on the University. The report concluded that the inability to send or receive e-mail prompted fears that grant and administrative information could be lost.
The outage also delayed an admissions communication to transfer students, interrupted some job searches and “critically impacted” planning for a major research event, the report stated.
A technical report released in conjunction with the “after action review” report confirmed the University’s explanation of what initially went wrong.
As officials explained after the incident, the failure was caused by a known computer “bug” in Cornell’s e-mail servers, which caused the servers to spontaneously reboot on the 994th day of continuous operation, which was June 15. University technology specialists worked around the clock over the next few days to restore approximately eight terabytes of e-mail data.
Looking ahead to future potential failures, members of the after action review group also expressed concern about a “lack of redundancy” built into the University’s e-mail system.
“CIT advanced requests for funding to create redundancy at least three times in the last five years. Each time it was turned down,” the report stated.
If redundancy is not immediately available, the report raised a number of other possibilities, including whether Cornell should consider outsourcing its e-mail services.