Skip to main content

A detailed look at last week's system outage

Much of the university experienced a major system outage between March 4 and 8, causing the Office of Information Technology to suspend some less-critical services, like streaming media, until systems were fully restored. Klara Jelinkova, Duke's senior director of shared services and infrastructure, answers questions about what happened during the outage.

What caused the outage?

Duke generates large amounts of data -- data collected in research and captured in lectures, photographs, web archives, personal files and so on -- all of which must be instantly accessible. Increasingly, institutions handle data like this through shared storage. Instead of giving each service a disk of its own (for instance, a server for Duke Today, another server for one department's shared files), we use high-capacity storage arrays to hold all the data and recall it for users on demand. These disk arrays are designed to allow for flexibility in providing the right amount for storage for each service as needed, and providing redundancy by making copies of the data to multiple disks.

The outage was caused by the failure of one of these storage arrays, a Hewlett Packard EVA 8000, holding 27 terabytes (27 trillion bytes) of university data.

And what caused the failure?

These HP machines are generally considered to be very reliable. Even when one disk array controller (the components that access the data on the disks) fails, a second controller allows for replacement of the failed part without an interruption in service. HP thought one of the two controllers on the array had failed, but replacing the controller didn't solve the problem. Eventually, we had to bring in a completely new array, something unheard of with this type of equipment. At this point, we still don't know the root cause of the problem with the array. HP is reviewing the logs and replicating our environment in their labs, and I expect them to come back to us with a root cause soon.

What effect did this failure have on users?

Without access to the data, a large number of services were completely unavailable. While the servers themselves and the underlying file system remained operational, the fact that those systems couldn't connect to the data meant they were effectively unusable. This meant WebFiles, duke.edu including Duke Today, dCal, Duke Wiki, streaming media services and other services hosted by OIT -- including many school and departmental Web sites and many test and development servers -- were unavailable or limited during at least part of the outage. As the outage continued, we worked with service owners to refine our prioritizations, and bring some services (Duke Today and Duke Wiki, for instance) back more quickly than others.

But other services stayed up. Why was that?

Two reasons. First, we operate several storage arrays, and only the data stored on this one HP array was impacted. Second, we prioritized the order in which we returned services to operation. Essential systems -- such as Time and Attendance, which records work hours for many bi-weekly staff -- were brought back online within just a few hours. Other less crucial services were deferred to later in the restoration process.

There were two outages really, right?

Yes. One controller on the array in question started to reboot on the evening of Monday, March 2. Controller reboots are significant, but not usually fatal. This controller rebooted multiple times, and then the entire array crashed later Monday night. Even when we restarted the array, it was unstable, so we immediately began moving critical services to other available disk space, and we told HP that we needed a new array. Users -- students, mostly, at that time of day -- experienced this outage as a loss of wireless service across campus because the inaccessible storage array contained the list of computers authorized to use the Duke network. Eventually, we replaced a controller, and HP gave us the "all clear" at 6 a.m. Tuesday.

And the second outage was caused by the same server array?

We restored all services, but we knew we hadn't really solved the problem without understanding the root cause, so we continued to work with HP to get replacement equipment installed. Our feeling was confirmed when controller reboots started again late on the morning of Wednesday, March 4.

At that point, we focused on:

• Working with constituents throughout Duke to identify critical services and ensure that those services were running from alternative storage space;

• Keeping the problem array running so we could transfer the data it held to the new array;

• Installing the new array and moving the data onto it.

We had the new storage array in place on Friday, March 6, but moving the 27 TB of data onto the new equipment, and testing to ensure all services were operating correctly, took until midday Sunday.

What have you learned from the experience?

We are assessing our overall performance now, but I am very proud of the way technical staff performed. I was also pleased at the broad-based community support we received. In such a large-scale outage, it is great to have technical colleagues to call upon from the Health System's technical team, DHTS, and other university IT staff outside OIT. Many of these individuals played important roles alongside OIT personnel in both planning and the work to restore service.

People worked around the clock, sometimes for days at a time, to keep essential systems in operation and to keep up the split-second timing that ensured migration of all the data to the new equipment as quickly as possible. They were cooperative and collaborative, and they gave up a lot of time with their family and friends to put us back together quickly.

Our staff was also adaptive. In the Monday night/Tuesday morning outage, we lost wireless access in many places, which affected students a great deal. During the second outage, we made sure wireless stayed up, and also made real-time adjustments to the order in which services were brought back online, in response to input from school IT personnel and university business offices.

Even so, we are analyzing how we can improve our performance in any future outage -- better internal functioning, better communication with users, etc. We need to better document the relationship between a particular storage array or piece of equipment and the impact to end users, so we can communicate the impact of outages quickly and precisely.

We also need to review the services we classify as critical on a regular basis to ensure the prioritization remains consistent with the current needs of the campus. If certain services previously deemed important but not essential have become essential, we can adjust their priority.

This was a significant outage, and we owe the university community a full accounting of the reasons for it, and the lessons we've learned. You can expect that report soon.