Asana: February 5 & 6
- Period: Two consecutive outages, with the second lasting roughly 20 minutes
- Signs: Service unavailability and degraded efficiency
- Trigger: A configuration change overloaded server logs on February 5, inflicting servers to restart. A second outage with related traits occurred the next day.
- Takeaways: “This pair of outages highlights the complexity of contemporary methods and the way it’s troublesome to check for each doable interplay state of affairs,” ThousandEyes reported. Following the incidents, Asana transitioned to staged configuration rollouts.
Slack: February 26
- Period: 9 hours
- Signs: Customers might log in and browse channels, however skilled points sending and receiving messages.
- Trigger: Points with a upkeep motion of their database methods brought on an overload of heavy site visitors directed on the database.
- Takeaways: “At first look, the whole lot appeared tremendous at Slack—community connectivity was good, there have been no latency points, and no packet loss,” in response to ThousandEyes. Solely by combining a number of diagnostic observations might investigators decide the true supply was the database system, later confirmed by Slack.
X: March 10
- Period: A number of hours with numerous service downtimes
- Signs: The platform appeared “down,” with customers experiencing connection failures just like a distributed denial-of-service (DDoS) assault.
- Trigger: Community failures with vital packet loss and connection errors on the TCP signaling section occurred. “Connection errors usually point out a deeper drawback on the community layer,” in response to ThousandEyes.
- Takeaways: ThousandEyes detected site visitors being dropped earlier than classes may very well be established. However there have been no seen BGP route modifications, which might usually happen throughout DDoS mitigation. “It was a network-level failure, however not what it might have first appeared,” ThousandEyes famous.
Zoom: April 16
- Period: Roughly two hours
- Signs: All Zoom providers have been unavailable globally.
- Trigger: Zoom’s title server (NS) data disappeared from the top-level area (TLD) nameservers, making the service unreachable regardless of wholesome infrastructure.
- Takeaways: “Though the servers themselves have been wholesome all through and have been answering appropriately when queried straight, the DNS resolvers couldn’t discover them due to the lacking data,” ThousandEyes reported. The incident highlights how failures above a company’s Area Identify System (DNS) layer can utterly knock out providers.
- Period: Greater than two hours
- Signs: The appliance’s front-end loaded usually, however tracks and movies wouldn’t play correctly.
- Trigger: Backend service points whereas community connectivity, DNS, and CDN “all appeared wholesome.”
- Takeaways: “The very important indicators have been all good: connectivity, DNS, and CDN all appeared wholesome,” in response to ThousandEyes, which added that the incident illustrated how “server-side failures can quietly cripple core performance whereas giving the looks that the whole lot is working usually.”
Google Cloud: June 12
- Period: Greater than two and a half hours
- Signs: Customers couldn’t use Google to authenticate on third-party apps equivalent to Spotify and Fitbit; knock-on penalties impacted Cloudflare providers and downstream purposes.
- Trigger: An invalid automated replace disrupted the corporate’s identification and entry administration (IAM) system.
- Takeaways: “What you had was a three-tier cascade: Google’s failure led to Cloudflare issues, which affected downstream purposes counting on Cloudflare,” ThousandEyes defined, including that the incident is a “reminder to hint a fault all the way in which again to supply.”
- Period: Multiple hour
- Signs: Site visitors couldn’t attain quite a few web sites and apps that depend on Cloudflare’s 1.1.1.1 DNS resolver.
- Trigger: A configuration error launched weeks earlier than was triggered by an unrelated change, prompting Cloudflare’s BGP route bulletins to fade from the worldwide web routing desk.
- Takeaways: “With no legitimate routes, site visitors couldn’t attain Cloudflare’s 1.1.1.1 DNS resolver,” ThousandEyes reported, including that the incident highlights “how flaws in configuration updates don’t all the time set off a right away disaster, as a substitute storing up issues for later.”
- Period: Greater than two hours
- Signs: The corporate’s cell app, web site, and ATM machines all went down and failed concurrently.
- Trigger: A shared backend dependency failed, affecting all buyer touchpoints, ThousandEyes estimated.
- Takeaways: “The truth that three completely different channels with three completely different frontend applied sciences failed eliminates app or UI points,” ThousandEyes famous, explaining that this incident demonstrated “how a single failure can immediately disable each buyer touchpoint—and why it’s very important to examine all indicators earlier than reaching for treatments.”
- Period: Each incidents lasted a number of hours
- Signs: The primary outage affected EMEA area customers with slowdowns and failures; the second impacted customers worldwide with HTTP 503 errors and connection timeouts.
- Trigger: The October 9 incident was brought on by software program defects that crashed edge websites within the EMEA area; the October 29 outage was triggered by a configuration change
- Takeaways: “Collectively, these two outages illustrate an vital distinction: infrastructure failures are usually regional with solely sure clients affected, whereas configuration errors usually hit all areas concurrently,” in response to ThousandEyes.
- Period: Greater than 15 hours for some clients
- Signs: Lengthy, international service disruptions affected main clients, together with Slack, Atlassian, and Snapchat.
- Trigger: Failure within the US-EAST-1 area, however international providers equivalent to IAM and DynamoDB World Tables trusted that regional endpoint, that means the outage propagated worldwide.
- Takeaways: “The incident highlights how a failure in a single, centralized service can ripple outwards via dependency chains that aren’t all the time apparent from structure diagrams,” ThousandEyes famous.
- Period: A number of hours of intermittent, international instability
- Signs: Intermittent service disruptions slightly than an entire outage
- Trigger: A nasty configuration file in Cloudflare’s Bot Administration system exceeded a hard-coded restrict, inflicting proxies to fail as they loaded the outsized file on staggered five-minute cycles.
- Takeaways: “As a result of the proxies refreshed configurations on staggered five-minute cycles, we didn’t see a lights-on/lights-off outage, however intermittent, international instability,” ThousandEyes reported, noting that the incident revealed how distributed edge mixed with staggered updates can create intermittent points.
Classes realized in 2025
ThousandEyes highlighted a number of takeaways for community operations groups trying to enhance their resilience in 2026:
Examine single signs as they are often deceptive. The true explanation for disruption can emerge from mixtures of indicators. “If the community appears wholesome however customers are experiencing points, the issue could be within the backend,” in response to ThousandEyes. “Simultaneous failures throughout channels level to shared dependencies, whereas intermittent failures might point out rollout or edge issues.”
Concentrate on speedy detection and response. The complexity of contemporary methods means it’s unrealistic to forestall each doable difficulty via testing alone. “As an alternative, deal with constructing speedy detection and response capabilities, utilizing strategies equivalent to staged rollouts and clear communication with stakeholders,” ThousandEyes acknowledged.
