# Is It Broken Everywhere or Just for Me with Omri Sass

Page: https://stenobird.com/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass
Text version: https://stenobird.com/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass.md
Podcast: [Screaming in the Cloud](https://stenobird.com/podcast/screaming-in-the-cloud)
Published: 2026-01-22T11:00:00+00:00
Episode link: https://share.transistor.fm/s/eae3ff44
Audio file: https://dts.podtrac.com/redirect.mp3/media.transistor.fm/eae3ff44/ba6763df.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/is-it-broken-everywhere-or-just-for-me-with-omri-sass
Duration seconds: 1867

## Resource

Distinguishing between a local code failure and a global cloud outage is critical for rapid incident response. Omri Sass explains how Datadog built updog.ai to use real-world machine learning data to detect service outages across major providers like AWS and Cloudflare.

## Highlights
- Main idea: Updog.ai uses massive amounts of real-world data from thousands of computers to detect outages, rather than relying on unreliable user reports
- Practical takeaway: Identifying a global provider outage immediately allows engineers to avoid wasting time debugging local code during a 3 AM incident
- Failure mode: Relying on manual endpoint testing is impossible at scale; instead, use anomaly detection to spot shifts in latency and error rates
- Industry trend: The centralization of infrastructure in a few hyperscalers means a single provider failure can cause massive, simultaneous global outages
- Technical challenge: Building a reliable detector requires sophisticated ML models to filter out one-off environment changes from true service outages

## Topics

Cloud Infrastructure, Observability, Incident Response, Machine Learning, AWS, SaaS Reliability, Site Reliability Engineering, Cloud Outages

## Chapters
- 3:40 — The 3 AM Decision: The critical distinction between a local environment issue and a global cloud outage during an incident.
- 5:55 — Detecting EC2 Outages via Anomaly Detection: How shifts in error rates and latency in Datadog's own systems revealed underlying AWS infrastructure failures.
- 8:05 — The Need for High-Level Visibility: Why engineers need an 'above the fold' view of service health to avoid chasing ghosts during outages.
- 10:25 — The Reality of Cloud Provider Failures: Moving past skepticism to understand the actual impact and scale of modern cloud outages.
- 22:00 — Refining Detection with Machine Learning: How Datadog uses proprietary ML models to distinguish between true outages and localized environment changes.
- 24:10 — Using Observability to Gate Deployments: Using external service health data to automatically pause or gate software deployments during instability.
- 28:45 — The Risks of Infrastructure Centralization: How the concentration of services in major hyperscalers creates new, large-scale systemic risks.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/is-it-broken-everywhere-or-just-for-me-with-omri-sass/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.