On Aug. 31, students were thrust into the shoes of those who experienced some of the most harrowing periods in American history—World War I, the Great Depression, the 1980s—when they too were forced to live without wireless internet access.
Students streamed into their first classes that Wednesday, but that was the only streaming going on. Beginning at 10 a.m., the River Campus’ Wi-Fi service was down for six long hours, forcing many students to take notes on paper and many professors to rapidly adjust opening-day lesson plans.
The culprit, according to the University’s Associate Chief Information Officer Robert Evangelista, was a bug in the code of the campus’ Cisco-branded wireless controllers that caused the hardware to reset.
Evangelista explained that once the first controller failed, likely due to high Wi-Fi utilization concentrated in the campus’ academic buildings triggering the bug, all of the wireless access points on that controller attempted to pass their load onto other controllers. This is because the access points are strung together in a redundancy mesh to stop catastrophic failures.
This time, however, that redundancy transmitted the bug to every wireless controller.
“This bug actually started resetting the hardware that manages the access points,” Evangelista said. “[…] They all failed over to the other one, caused all these issues, and it started a cycle of resetting every controller.”
By 11 a.m., Cisco confirmed the controllers were experiencing the bug, which has been public since July. There are thousands of access points on campus, and a bit of patch code had to be uploaded to each of them in chunks of a couple hundred, prolonging the outage to around 4 p.m.
The experience has prompted some changes in the IT department. Julie Myers, the University’s vice president for information technology and chief information officer, said Cisco will be coming in for a network-wide resiliency test a few weeks before classes start to help identify potential problems and to keep the University’s engineers updated on bug fixes.
“We are planning a meeting with senior leaders from Cisco, and we’re going to work together to determine the robustness of all of this,” said Myers. “We all know software has bugs, so we need to continuously balance the resiliency of what the University is expecting from services with making sure we’re not unintentionally introducing bugs into the system either.”