# Is It Broken Everywhere or Just for Me with Omri Sass Page: https://stenobird.com/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass Text version: https://stenobird.com/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass.md Podcast: [Screaming in the Cloud](https://stenobird.com/podcast/screaming-in-the-cloud) Published: 2026-01-22T11:00:00+00:00 Episode link: https://share.transistor.fm/s/eae3ff44 Audio file: https://dts.podtrac.com/redirect.mp3/media.transistor.fm/eae3ff44/ba6763df.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/is-it-broken-everywhere-or-just-for-me-with-omri-sass Duration seconds: 1867 ## Resource Distinguishing between a local code failure and a global cloud outage is critical for rapid incident response. Omri Sass explains how Datadog built updog.ai to use real-world machine learning data to detect service outages across major providers like AWS and Cloudflare. ## Highlights - Main idea: Updog.ai uses massive amounts of real-world data from thousands of computers to detect outages, rather than relying on unreliable user reports - Practical takeaway: Identifying a global provider outage immediately allows engineers to avoid wasting time debugging local code during a 3 AM incident - Failure mode: Relying on manual endpoint testing is impossible at scale; instead, use anomaly detection to spot shifts in latency and error rates - Industry trend: The centralization of infrastructure in a few hyperscalers means a single provider failure can cause massive, simultaneous global outages - Technical challenge: Building a reliable detector requires sophisticated ML models to filter out one-off environment changes from true service outages ## Topics Cloud Infrastructure, Observability, Incident Response, Machine Learning, AWS, SaaS Reliability, Site Reliability Engineering, Cloud Outages ## Chapters - 3:40 — The 3 AM Decision: The critical distinction between a local environment issue and a global cloud outage during an incident. - 5:55 — Detecting EC2 Outages via Anomaly Detection: How shifts in error rates and latency in Datadog's own systems revealed underlying AWS infrastructure failures. - 8:05 — The Need for High-Level Visibility: Why engineers need an 'above the fold' view of service health to avoid chasing ghosts during outages. - 10:25 — The Reality of Cloud Provider Failures: Moving past skepticism to understand the actual impact and scale of modern cloud outages. - 22:00 — Refining Detection with Machine Learning: How Datadog uses proprietary ML models to distinguish between true outages and localized environment changes. - 24:10 — Using Observability to Gate Deployments: Using external service health data to automatically pause or gate software deployments during instability. - 28:45 — The Risks of Infrastructure Centralization: How the concentration of services in major hyperscalers creates new, large-scale systemic risks. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/is-it-broken-everywhere-or-just-for-me-with-omri-sass/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.