Comment: Facebook Locked The Doors And Left The Key Inside – Insight Into The Outage

BACKGROUND:

As many of us witnessed yesterday, WhatsApp, Instagram and Facebook had all gone down in a major outage. The three apps – which are all owned by Facebook, and run on shared infrastructure – all completely stopped working shortly before 5pm. Other products that are part of the same family of apps, such as Facebook Workplace, also stopped working. Visitors to the Facebook website simply saw an error page or a message that their browser could not connect. The WhatsApp and Instagram apps continued to work, but did not show new content, including any messages sent or received during the problems.

Cybersecurity experts weigh in on Facebook latest outage.

Subscribe
Notify of
guest

2 Expert Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
Ron Bradley
InfoSec Expert
October 6, 2021 4:59 pm

<p>The Human Element is the Most Vulnerable Element</p>
<p>While it\’s too soon to confirm, it\’s widely believed the recent outage on Facebook was related to DNS configurations and/or BGP routes, so what does this mean?  DNS stands for domain name service and BGP is the border gateway protocol. Think of it this way. When you want to get driving directions to your favorite restaurant you may or may not know the address (DNS), but that\’s ok, because the address is static and not likely to change. You then rely on your smart device to get directions (BGP) with the fastest route for you. The same is true for Internet traffic.</p>
<p>How does this relate back to Facebook and the human element? Business computer \"street addresses\" rarely (if ever) change, especially on the global scale of Facebook. Millions of users asked their phone or computer to take them to Facebook, and the route was unknown, too busy, or inaccessible (happens all the time in L.A., traffic there is brutal). DNS servers and BGP routers are closely guarded assets due to their criticality. Imagine closing down the Golden Gate bridge or the Lincoln tunnel during rush hour. Internet routers, switches, firewalls, and DNS servers don\’t change configuration without human action. Whether it was intentional or accidental, internal or external, the fact remains it was a major outage and I\’m certain Facebook is deep in the throes of a root cause analysis.</p>
<p>Now more than ever, third party risk management practices must ensure basic IT security tenants such as change control, privileged access management, logging and reporting, intrusion detection/prevention, along with all of the other layers of the security onion which envelop them. The trust but verify model works, but you have to do both. It would be fascinating to see their SOC 2 Type II report.</p>

Last edited 1 year ago by Ron Bradley
Paul (PJ) Norris
Paul (PJ) Norris , Senior Systems Engineer
InfoSec Expert
October 6, 2021 4:54 pm

<p><strong>Facebook locked the doors and left the key inside</strong></p>
<p>Facebook experienced a global outage that lasted over six hours before services were restored. This impacted WhatsApp, Instagram and other services that depend on Facebook authentication, such as Facebook’s Oculus VR headset. But for a technology giant such as Facebook, why did it take so long to identify the root cause and resolve the issue?</p>
<p>\"As information unfolds, they identified the issue pretty quickly, but due to the nature of the issue, they practically locked themselves out.</p>
<p>Around 15.40 UTC on Monday 4<sup>th</sup> October, a change was made to the  BGP – Border Gateway Protocol. BGP is a technology which ISP’s share information about which providers are responsible for routing Internet traffic to which specific groups of Internet addresses. In other words, Facebook inadvertently removed the ability to tell the world where it lives.</p>
<p>Backing out the change was not easy though, since Facebook uses their own in house communication and email services which were impacted by the outage. With people working remote during the pandemic, this was a big issue. Those who were onsite at the data centres and offices who were trying to back out the change, were unable to access the environments as the door access control system was down due to the impact of the outage.</p>
<p>So the question always comes down to, “could this have been avoided?” It’s evident at this early stage that Facebook had a single point of failure that cascaded in to a significant and costly outage for the technology giant. Any changes, especially to critical services, should be tested, and double checked before implementation. It’s unclear around the circumstances of this change to the BGP at this point in time, so it’s speculative on how this happened.</p>
<p>Disaster recovery programs should be put in place and tested regularly. Since Facebook depended on its own in house services and communications, as well as access control systems, it would seem obvious that if that one system failed, this would have a cascading affect?</p>

Last edited 1 year ago by Paul (PJ) Norris
Information Security Buzz
2
0
Would love your thoughts, please comment.x
()
x