Contents

Asana: February 5 & 6 Slack: February 26 X: March 10 Zoom: April 16 Google Cloud: June 12 Classes realized in 2025

Asana: February 5 & 6

Period: Two consecutive outages, with the second lasting roughly 20 minutes

Signs: Service unavailability and degraded efficiency
Trigger: A configuration change overloaded server logs on February 5, inflicting servers to restart. A second outage with related traits occurred the next day.
Takeaways: “This pair of outages highlights the complexity of contemporary methods and the way it’s troublesome to check for each doable interplay state of affairs,” ThousandEyes reported. Following the incidents, Asana transitioned to staged configuration rollouts.

Slack: February 26

Period: 9 hours
Signs: Customers might log in and browse channels, however skilled points sending and receiving messages.
Trigger: Points with a upkeep motion of their database methods brought on an overload of heavy site visitors directed on the database.
Takeaways: “At first look, the whole lot appeared tremendous at Slack—community connectivity was good, there have been no latency points, and no packet loss,” in response to ThousandEyes. Solely by combining a number of diagnostic observations might investigators decide the true supply was the database system, later confirmed by Slack.

X: March 10

Period: A number of hours with numerous service downtimes
Signs: The platform appeared “down,” with customers experiencing connection failures just like a distributed denial-of-service (DDoS) assault.
Trigger: Community failures with vital packet loss and connection errors on the TCP signaling section occurred. “Connection errors usually point out a deeper drawback on the community layer,” in response to ThousandEyes.
Takeaways: ThousandEyes detected site visitors being dropped earlier than classes may very well be established. However there have been no seen BGP route modifications, which might usually happen throughout DDoS mitigation. “It was a network-level failure, however not what it might have first appeared,” ThousandEyes famous.

Zoom: April 16

Period: Roughly two hours
Signs: All Zoom providers have been unavailable globally.
Trigger: Zoom’s title server (NS) data disappeared from the top-level area (TLD) nameservers, making the service unreachable regardless of wholesome infrastructure.
Takeaways: “Though the servers themselves have been wholesome all through and have been answering appropriately when queried straight, the DNS resolvers couldn’t discover them due to the lacking data,” ThousandEyes reported. The incident highlights how failures above a company’s Area Identify System (DNS) layer can utterly knock out providers.

Period: Greater than two hours
Signs: The appliance’s front-end loaded usually, however tracks and movies wouldn’t play correctly.

Trigger: Backend service points whereas community connectivity, DNS, and CDN “all appeared wholesome.”
Takeaways: “The very important indicators have been all good: connectivity, DNS, and CDN all appeared wholesome,” in response to ThousandEyes, which added that the incident illustrated how “server-side failures can quietly cripple core performance whereas giving the looks that the whole lot is working usually.”

Google Cloud: June 12

Period: Greater than two and a half hours
Signs: Customers couldn’t use Google to authenticate on third-party apps equivalent to Spotify and Fitbit; knock-on penalties impacted Cloudflare providers and downstream purposes.

Trigger: An invalid automated replace disrupted the corporate’s identification and entry administration (IAM) system.
Takeaways: “What you had was a three-tier cascade: Google’s failure led to Cloudflare issues, which affected downstream purposes counting on Cloudflare,” ThousandEyes defined, including that the incident is a “reminder to hint a fault all the way in which again to supply.”

Period: Multiple hour
Signs: Site visitors couldn’t attain quite a few web sites and apps that depend on Cloudflare’s 1.1.1.1 DNS resolver.

Trigger: A configuration error launched weeks earlier than was triggered by an unrelated change, prompting Cloudflare’s BGP route bulletins to fade from the worldwide web routing desk.
Takeaways: “With no legitimate routes, site visitors couldn’t attain Cloudflare’s 1.1.1.1 DNS resolver,” ThousandEyes reported, including that the incident highlights “how flaws in configuration updates don’t all the time set off a right away disaster, as a substitute storing up issues for later.”

Period: Greater than two hours
Signs: The corporate’s cell app, web site, and ATM machines all went down and failed concurrently.

Trigger: A shared backend dependency failed, affecting all buyer touchpoints, ThousandEyes estimated.
Takeaways: “The truth that three completely different channels with three completely different frontend applied sciences failed eliminates app or UI points,” ThousandEyes famous, explaining that this incident demonstrated “how a single failure can immediately disable each buyer touchpoint—and why it’s very important to examine all indicators earlier than reaching for treatments.”

Period: Each incidents lasted a number of hours
Signs: The primary outage affected EMEA area customers with slowdowns and failures; the second impacted customers worldwide with HTTP 503 errors and connection timeouts.

Trigger: The October 9 incident was brought on by software program defects that crashed edge websites within the EMEA area; the October 29 outage was triggered by a configuration change
Takeaways: “Collectively, these two outages illustrate an vital distinction: infrastructure failures are usually regional with solely sure clients affected, whereas configuration errors usually hit all areas concurrently,” in response to ThousandEyes.

Period: Greater than 15 hours for some clients
Signs: Lengthy, international service disruptions affected main clients, together with Slack, Atlassian, and Snapchat.

Trigger: Failure within the US-EAST-1 area, however international providers equivalent to IAM and DynamoDB World Tables trusted that regional endpoint, that means the outage propagated worldwide.
Takeaways: “The incident highlights how a failure in a single, centralized service can ripple outwards via dependency chains that aren’t all the time apparent from structure diagrams,” ThousandEyes famous.

Period: A number of hours of intermittent, international instability
Signs: Intermittent service disruptions slightly than an entire outage
Trigger: A nasty configuration file in Cloudflare’s Bot Administration system exceeded a hard-coded restrict, inflicting proxies to fail as they loaded the outsized file on staggered five-minute cycles.

Takeaways: “As a result of the proxies refreshed configurations on staggered five-minute cycles, we didn’t see a lights-on/lights-off outage, however intermittent, international instability,” ThousandEyes reported, noting that the incident revealed how distributed edge mixed with staggered updates can create intermittent points.

Classes realized in 2025

ThousandEyes highlighted a number of takeaways for community operations groups trying to enhance their resilience in 2026:

Examine single signs as they are often deceptive. The true explanation for disruption can emerge from mixtures of indicators. “If the community appears wholesome however customers are experiencing points, the issue could be within the backend,” in response to ThousandEyes. “Simultaneous failures throughout channels level to shared dependencies, whereas intermittent failures might point out rollout or edge issues.”

Concentrate on speedy detection and response. The complexity of contemporary methods means it’s unrealistic to forestall each doable difficulty via testing alone. “As an alternative, deal with constructing speedy detection and response capabilities, utilizing strategies equivalent to staged rollouts and clear communication with stakeholders,” ThousandEyes acknowledged.

Source link

Top 11 network outages and application failures of 2025

Asana: February 5 & 6

Slack: February 26

X: March 10

Zoom: April 16

Google Cloud: June 12

Classes realized in 2025

Leave a Reply Cancel reply

Your Trusted Source for Accurate and Timely Updates!

Popular Posts

Richard Borenstein (Mirantis) – HostingJournalist.com

nLighten expands leadership team | Data Centre Solutions

Equinix signs new wind PPA in Sweden with Neoen

How to avoid cloud whiplash

b.well Connected Health Secures $40M in Series C Funding

About US

Top Categories

Usefull Links