# The Hidden Challenges of Running AI at Scale in Production Page: https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/the-hidden-challenges-of-running-ai-at-scale-in-production Text version: https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/the-hidden-challenges-of-running-ai-at-scale-in-production.md Podcast: [The Data Exchange with Ben Lorica](https://stenobird.com/podcast/the-data-exchange-with-ben-lorica) Published: 2026-03-12T11:00:00+00:00 Episode link: https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/18789806-the-hidden-challenges-of-running-ai-at-scale-in-production.mp3 Audio file: https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/18789806-the-hidden-challenges-of-running-ai-at-scale-in-production.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/the-data-exchange-with-ben-lorica/episodes/the-hidden-challenges-of-running-ai-at-scale-in-production Duration seconds: 1941 ## Resource Moving AI from pilot to production requires a fundamental shift from experimentation to managing complex, multi-node infrastructure. Chen Goldberg explains how optimizing 'goodput' and observability is critical for scaling AI workloads effectively. ## Highlights - Main idea: Scaling AI requires moving beyond single-node thinking to managing complex multi-node orchestration and networking - Practical takeaway: Focus on 'goodput'—the actual time GPUs spend performing useful work—by optimizing data throughput and caching - Failure mode: Relying on bad benchmarks or high-level abstractions without visibility into the underlying hardware bottlenecks - Main idea: The transition to AI-first clouds is driven by the need for specialized hardware orchestration that general-purpose clouds lack - Practical takeaway: Use AI-driven observability to unify telemetry across storage, network, and workloads to accelerate troubleshooting ## Topics AI Infrastructure, GPU Computing, Cloud Engineering, Machine Learning Operations, Distributed Systems, Kubernetes, Data Observability, CoreWeave ## Chapters - 1:00 — The Reality of AI Production: Debunking the myth that AI is stuck in the pilot phase and discussing the shift toward real-world production use cases. - 3:30 — Choosing an AI-First Cloud: When enterprises should move away from established general-purpose cloud providers toward specialized AI infrastructure. - 8:20 — Optimizing GPU Goodput: How to maximize compute efficiency by addressing bottlenecks in data volume, throughput, and caching mechanisms. - 10:40 — The Complexity of Multi-Node Systems: The engineering challenges introduced by moving from single-node tasks to highly available, distributed AI orchestration. - 15:20 — Unified Observability and Mission Control: Using integrated telemetry to gain transparency into the entire stack, from storage to workload performance. - 27:30 — Navigating Technical Debt and Career Growth: Advice for engineers on leveraging new AI tools to augment expertise rather than replacing the need for deep domain knowledge. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/the-data-exchange-with-ben-lorica/episodes/the-hidden-challenges-of-running-ai-at-scale-in-production/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/the-hidden-challenges-of-running-ai-at-scale-in-production.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.