Episode
Is It Broken Everywhere or Just for Me with Omri Sass
- Podcast
- Screaming in the Cloud
- Published
- Jan 22, 2026
- Duration seconds
- 1867
- Processing state
processed- Canonical source
- https://share.transistor.fm/s/eae3ff44
Actions
POST https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/is-it-broken-everywhere-or-just-for-me-with-omri-sass/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Distinguishing between a local code failure and a global cloud outage is critical for rapid incident response. Omri Sass explains how Datadog built updog.ai to use real-world machine learning data to detect service outages across major providers like AWS and Cloudflare.
Topics
- Cloud Infrastructure
- Observability
- Incident Response
- Machine Learning
- AWS
- SaaS Reliability
- Site Reliability Engineering
- Cloud Outages
Highlights
- Main idea: Updog.ai uses massive amounts of real-world data from thousands of computers to detect outages, rather than relying on unreliable user reports
- Practical takeaway: Identifying a global provider outage immediately allows engineers to avoid wasting time debugging local code during a 3 AM incident
- Failure mode: Relying on manual endpoint testing is impossible at scale; instead, use anomaly detection to spot shifts in latency and error rates
- Industry trend: The centralization of infrastructure in a few hyperscalers means a single provider failure can cause massive, simultaneous global outages
- Technical challenge: Building a reliable detector requires sophisticated ML models to filter out one-off environment changes from true service outages
Chapters
3:40The 3 AM Decision: The critical distinction between a local environment issue and a global cloud outage during an incident.5:55Detecting EC2 Outages via Anomaly Detection: How shifts in error rates and latency in Datadog's own systems revealed underlying AWS infrastructure failures.8:05The Need for High-Level Visibility: Why engineers need an 'above the fold' view of service health to avoid chasing ghosts during outages.10:25The Reality of Cloud Provider Failures: Moving past skepticism to understand the actual impact and scale of modern cloud outages.22:00Refining Detection with Machine Learning: How Datadog uses proprietary ML models to distinguish between true outages and localized environment changes.24:10Using Observability to Gate Deployments: Using external service health data to automatically pause or gate software deployments during instability.28:45The Risks of Infrastructure Centralization: How the concentration of services in major hyperscalers creates new, large-scale systemic risks.