Episode

Is It Broken Everywhere or Just for Me with Omri Sass

Podcast: Screaming in the Cloud
Published: Jan 22, 2026
Duration seconds: 1867
Processing state: processed
Canonical source: https://share.transistor.fm/s/eae3ff44
Audio: https://dts.podtrac.com/redirect.mp3/media.transistor.fm/eae3ff44/ba6763df.mp3
JSON: /v1/public/podcasts/screaming-in-the-cloud/episodes/is-it-broken-everywhere-or-just-for-me-with-omri-sass
Markdown: /podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass.md

Actions

POST https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/is-it-broken-everywhere-or-just-for-me-with-omri-sass/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Distinguishing between a local code failure and a global cloud outage is critical for rapid incident response. Omri Sass explains how Datadog built updog.ai to use real-world machine learning data to detect service outages across major providers like AWS and Cloudflare.

Topics

Cloud Infrastructure
Observability
Incident Response
Machine Learning
AWS
SaaS Reliability
Site Reliability Engineering
Cloud Outages

Highlights

Main idea: Updog.ai uses massive amounts of real-world data from thousands of computers to detect outages, rather than relying on unreliable user reports
Practical takeaway: Identifying a global provider outage immediately allows engineers to avoid wasting time debugging local code during a 3 AM incident
Failure mode: Relying on manual endpoint testing is impossible at scale; instead, use anomaly detection to spot shifts in latency and error rates
Industry trend: The centralization of infrastructure in a few hyperscalers means a single provider failure can cause massive, simultaneous global outages
Technical challenge: Building a reliable detector requires sophisticated ML models to filter out one-off environment changes from true service outages

Chapters

3:40 The 3 AM Decision: The critical distinction between a local environment issue and a global cloud outage during an incident.
5:55 Detecting EC2 Outages via Anomaly Detection: How shifts in error rates and latency in Datadog's own systems revealed underlying AWS infrastructure failures.
8:05 The Need for High-Level Visibility: Why engineers need an 'above the fold' view of service health to avoid chasing ghosts during outages.
10:25 The Reality of Cloud Provider Failures: Moving past skepticism to understand the actual impact and scale of modern cloud outages.
22:00 Refining Detection with Machine Learning: How Datadog uses proprietary ML models to distinguish between true outages and localized environment changes.
24:10 Using Observability to Gate Deployments: Using external service health data to automatically pause or gate software deployments during instability.
28:45 The Risks of Infrastructure Centralization: How the concentration of services in major hyperscalers creates new, large-scale systemic risks.