Calmail crashes last multiple days

UC Berkeley’s campuswide email system crashed Friday afternoon and was inaccessible for over 50 hours, leaving tens of thousands of students, faculty and staff e-stranded as problems persisted into the week.

The system, CalMail, went down Friday at 9:53 a.m. when a piece of system hardware failed, causing the entire server to crash. When staff attempted to fix the problem, the database that holds all account information became corrupted and had to be rebuilt from scratch, according to Shelton Waggener, campus associate vice chancellor for information technology and chief information officer.

According to Waggener, CalMail was already running more slowly than usual over the past few months due to an unprecedented increase in devices and users attempting to connect to the server.

Staff worked around the clock over the weekend to fix the service, which was brought back to life Sunday afternoon at 4:55 p.m, according to Waggener.

But another part of the system failed Monday at 12:45 a.m., which Waggener said is a relatively common occurrence and would have been easily dealt with normally but, with an increased spike in users, the failure once again shut down the system.

It was at that point, Waggener said, that his staff made the “difficult decision” to cut off CalMail access for most mobile devices and regular email clients like Microsoft Outlook or Apple iPhones, which are constantly checking for new emails and therefore creating added traffic to the server.

“(It’s like) a huge crowd is outside and everybody’s trying to get through the door,” he said. “What effectively happens is that nobody gets in. It (becomes) so busy that it’s consistently providing error messages.”

As of Wednesday, the system was back up for Web clients. Waggener noted that the campus has been planning to move the roughly 70,000 current users off of CalMail in 2012 as part of the campus cost-cutting Operational Excellence initiative’s Productivity Suite project.

“I made the decision not to spend the million dollars to upgrade CalMail software for only 12 months of use,” he said. “We were trying to be prudent given the budget situation, (but) now in retrospective it would have been good to have invested money in the storage so we would have avoided this crisis.”

He also noted that the crash was unrelated to various department faculty recently joining CalMail.

“The system was sized for this number of users, but the additional devices and connections created a spike of load that pushed the environment past its ability to keep up,” he said. “Then when the hardware failed we went over that line.”

Junior Steven Johnson was working on a marketing project with an associate from UCLA for UC Berkeley’s Residential and Student Services Programs when the CalMail went down.

“I was working … purely through email correspondence,” he said. “I wasn’t able to access it until Monday … and I couldn’t move forward because I was waiting for his response. It was at a standstill.”

Freshman Erik Chen, who learned of the crash via Facebook, said he was frustrated with the volatility of the system.

“I felt that it is ridiculous that our university can’t make sure that something fundamental to day to day operations is up and running,” Chen said in an email.