Episode

What it's really like to run AGI safety at Google DeepMind (and where I disagree with 'doomers') | Rohin Shah

Podcast
80,000 Hours Podcast
Published
Jun 2, 2026
Duration seconds
10107
Processing state
not_requested
Canonical source
https://80000hours.org/podcast/episodes/rohin-shah-google-deepmind-agi-safety/?utm_campaign=podcast__rohin-shah&utm_source=80000+Hours+Podcast&utm_medium=podcast
Audio
https://media.transistor.fm/384dada1/af00734a.mp3
JSON
/v1/public/podcasts/80-000-hours-podcast-747608/episodes/what-it-s-really-like-to-run-agi-safety-at-google-deepmind-and-where-i-disagree-with-doomers-rohin-shah
Markdown
/podcast/80-000-hours-podcast-747608/what-it-s-really-like-to-run-agi-safety-at-google-deepmind-and-where-i-disagree-with-doomers-rohin-shah.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/80-000-hours-podcast-747608/episodes/what-it-s-really-like-to-run-agi-safety-at-google-deepmind-and-where-i-disagree-with-doomers-rohin-shah/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/80-000-hours-podcast-747608/what-it-s-really-like-to-run-agi-safety-at-google-deepmind-and-where-i-disagree-with-doomers-rohin-shah.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Most people working on AI safety think without a massive effort AI systems will probably end up with goals catastrophically different from humanity’s. Today’s guest, Rohin Shah — head of AGI Safety and Alignment at Google DeepMind, and an AI safety researcher since 2017 — disagrees. “There is no particularly compelling argument that this is the thing that happens by default,” Rohin explains. “There’s a lot of arguments that are suggestive that maybe it could happen, such that you should find it plausible. That’s sufficient to justify a significant amount of effort into averting it, which is why I work in the area I do. But none of them rise to the level of, ‘I’m expecting this to happen by default.'” Take the worry that AIs will accidentally be trained to be deceptive. Sure, it’s possible. But we’re not running reinforcement learning over year-long trajectories — for now, we’re running it over a week at most. The natural prediction is that models learn to grab short-term reward, not that they develop the ambitious long-horizon goals required for convergent power-seeking. What about current examples of models lying and scheming? Rohin has looked into the details, and most don’t really resemble the thing we really fear: a competent AI pursuing an ambitious misaligned goal. Anthropic’s “alignment faking” results, for instance, show a model trying to preserve its trained values against modification, which is arguably what it was trained to do. Rohin also expects we’ll see problems coming. There’s some generalisation risk at the point where AIs become powerful enough to actually take over, but the underlying challenges — overseeing superhuman systems, interpretability — are things we can iterate on now. Host Rob Wiblin pushes back on the case for AI optimism, and they also…