Social media platforms are ingrained into the lives of billions of people across the globe and the unavailability of any one of them brings their life to a grinding halt. This is what billions of users of Facebook and its products like WhatsApp, Messenger, and Instagram experienced on Monday, October 4, when their pages displayed an error message. The Facebook systems were down, and all their services and apps were unavailable for over five hours.
One was not sure about the reason behind the outage, and with cyberattacks being the order of the day, there was high speculation of a possible cyberattack responsible for disrupting the services.
Competing platforms like Twitter, Snapchat, Telegram witnessed a traffic surge with people seeking clarification, poking fun, and sharing updates on the outage.
Facebook itself had to resort to tweeting to reach out to its user base and update on the unavailability of the service.
We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience.
— Facebook (@Facebook) October 4, 2021
The Technical Fallout
Facebook soon came up with an apology and an update on the technical reason behind the outage. The company said the problem was due to faulty configuration changes made to Facebook routers. These are the routers that coordinate the network traffic between their data centers. The routers could not communicate and hence caused the services to halt. In technical terms, this concerns the Border Gateway Protocol (BGP).
What is BGP?
Border Gateway Protocol is a standardized exterior gateway protocol designed to exchange routing and reachability information among autonomous systems (AS) on the Internet. BGP is classified as a path-vector routing protocol, and it makes routing decisions based on paths, network policies, or rulesets configured by a network administrator.
In plain English, BGP routes information between networks across the Internet. BGP interconnects various networks and facilitates communication between networks and the rest of the Internet.
Santosh Janardhan, VP Infrastructure, Facebook, shared on his page, “Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear that there was no malicious activity behind this outage — its root cause was a faulty configuration change on our end. We also have no evidence that user data was compromised as a result of this downtime.
We’ve been working as hard as we can to restore access, and our systems are now back up and running. The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.”
Industry experts and sources are voicing it as a DNS issue where BGP routes (or maps) have vanished.
Cloudflare, an American web infrastructure and website security company, in its blog described it as a BGP problem.
From trusted source: Person on FB recovery effort said the outage was from a routine BGP update gone wrong. But the update blocked remote users from reverting changes, and people with physical access didn’t have network/logical access. So blocked at both ends from reversing it.
— briankrebs (@briankrebs) October 4, 2021
In the most recent update, Facebook attributed the problem to an internal command error.
“During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command,” Facebook shared.
There has been an internal and an external view of the reason behind the outage and this instance leads to numerous other issues related to security and vulnerability.
Per Forrester Senior Analyst, Alla Valente, Security & Risk (Risk Management), “In Facebook’s quest to integrate its products and underlying technical infrastructure into a single platform is the concentration risk it creates for the company, where a single risk event that produces a cascading effect – in this case, the inability of their machines to talk to one another brought the company to a standstill. Concentration risk is one of the top systemic risks for 2021 that Forrester identified early this year. And Facebook’s size, market share, and ubiquity make it a system into itself. If the company doesn’t get better at managing its risks across the organization, it stands to lose its tight hold it’s been struggling for years to maintain.”