Episode

Is It Broken Everywhere or Just for Me with Omri Sass

Podcast
Screaming in the Cloud
Published
Jan 22, 2026
Duration seconds
1867
Processing state
processed
Canonical source
https://share.transistor.fm/s/eae3ff44
Audio
https://dts.podtrac.com/redirect.mp3/media.transistor.fm/eae3ff44/ba6763df.mp3
JSON
/v1/public/podcasts/screaming-in-the-cloud/episodes/is-it-broken-everywhere-or-just-for-me-with-omri-sass
Markdown
/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/is-it-broken-everywhere-or-just-for-me-with-omri-sass/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/screaming-in-the-cloud/is-it-broken-everywhere-or-just-for-me-with-omri-sass.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Distinguishing between a local code failure and a global cloud outage is critical for rapid incident response. Omri Sass explains how Datadog built updog.ai to use real-world machine learning data to detect service outages across major providers like AWS and Cloudflare.

Topics

  • Cloud Infrastructure
  • Observability
  • Incident Response
  • Machine Learning
  • AWS
  • SaaS Reliability
  • Site Reliability Engineering
  • Cloud Outages

Highlights

  • Main idea: Updog.ai uses massive amounts of real-world data from thousands of computers to detect outages, rather than relying on unreliable user reports
  • Practical takeaway: Identifying a global provider outage immediately allows engineers to avoid wasting time debugging local code during a 3 AM incident
  • Failure mode: Relying on manual endpoint testing is impossible at scale; instead, use anomaly detection to spot shifts in latency and error rates
  • Industry trend: The centralization of infrastructure in a few hyperscalers means a single provider failure can cause massive, simultaneous global outages
  • Technical challenge: Building a reliable detector requires sophisticated ML models to filter out one-off environment changes from true service outages

Chapters

  1. 3:40 The 3 AM Decision: The critical distinction between a local environment issue and a global cloud outage during an incident.
  2. 5:55 Detecting EC2 Outages via Anomaly Detection: How shifts in error rates and latency in Datadog's own systems revealed underlying AWS infrastructure failures.
  3. 8:05 The Need for High-Level Visibility: Why engineers need an 'above the fold' view of service health to avoid chasing ghosts during outages.
  4. 10:25 The Reality of Cloud Provider Failures: Moving past skepticism to understand the actual impact and scale of modern cloud outages.
  5. 22:00 Refining Detection with Machine Learning: How Datadog uses proprietary ML models to distinguish between true outages and localized environment changes.
  6. 24:10 Using Observability to Gate Deployments: Using external service health data to automatically pause or gate software deployments during instability.
  7. 28:45 The Risks of Infrastructure Centralization: How the concentration of services in major hyperscalers creates new, large-scale systemic risks.