An interruption to the service of social media network Facebook on 4 October was the result of maintenance-related configuration changes that triggered a large-scale disruption to communication between data centres.
In a blog post by Facebook’s vice president of infrastructure Santosh Janardhan, the events that led to Facebook – and its family of apps including Instagram, WhatsApp and Messenger – going offline for more than five hours were detailed. During maintenance of the ‘backbone’ of the network, a command was issues to assess how much capacity was available. However, the command failed, and an audit tool designed to stop mistaken commands failed to identify the error.
This single fault quickly led to the disconnection of links between data centres and the internet. Further complication was added when Facebook’s engineers were initially unable to restore access because its data centres are heavily protected and employees could not gain immediate entry.
Janardhan said: “Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one. After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway.
“We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a trade-off like this is worth it – greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this. From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.”