Episode

The Hidden Challenges of Running AI at Scale in Production

Podcast: The Data Exchange with Ben Lorica
Published: Mar 12, 2026
Duration seconds: 1941
Processing state: processed
Canonical source: https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/18789806-the-hidden-challenges-of-running-ai-at-scale-in-production.mp3
Audio: https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/18789806-the-hidden-challenges-of-running-ai-at-scale-in-production.mp3
JSON: /v1/public/podcasts/the-data-exchange-with-ben-lorica/episodes/the-hidden-challenges-of-running-ai-at-scale-in-production
Markdown: /podcast/the-data-exchange-with-ben-lorica/the-hidden-challenges-of-running-ai-at-scale-in-production.md

Actions

POST https://stenobird.com/v1/public/podcasts/the-data-exchange-with-ben-lorica/episodes/the-hidden-challenges-of-running-ai-at-scale-in-production/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/the-hidden-challenges-of-running-ai-at-scale-in-production.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Moving AI from pilot to production requires a fundamental shift from experimentation to managing complex, multi-node infrastructure. Chen Goldberg explains how optimizing 'goodput' and observability is critical for scaling AI workloads effectively.

Topics

AI Infrastructure
GPU Computing
Cloud Engineering
Machine Learning Operations
Distributed Systems
Kubernetes
Data Observability
CoreWeave

Highlights

Main idea: Scaling AI requires moving beyond single-node thinking to managing complex multi-node orchestration and networking
Practical takeaway: Focus on 'goodput'—the actual time GPUs spend performing useful work—by optimizing data throughput and caching
Failure mode: Relying on bad benchmarks or high-level abstractions without visibility into the underlying hardware bottlenecks
Main idea: The transition to AI-first clouds is driven by the need for specialized hardware orchestration that general-purpose clouds lack
Practical takeaway: Use AI-driven observability to unify telemetry across storage, network, and workloads to accelerate troubleshooting

Chapters

1:00 The Reality of AI Production: Debunking the myth that AI is stuck in the pilot phase and discussing the shift toward real-world production use cases.
3:30 Choosing an AI-First Cloud: When enterprises should move away from established general-purpose cloud providers toward specialized AI infrastructure.
8:20 Optimizing GPU Goodput: How to maximize compute efficiency by addressing bottlenecks in data volume, throughput, and caching mechanisms.
10:40 The Complexity of Multi-Node Systems: The engineering challenges introduced by moving from single-node tasks to highly available, distributed AI orchestration.
15:20 Unified Observability and Mission Control: Using integrated telemetry to gain transparency into the entire stack, from storage to workload performance.
27:30 Navigating Technical Debt and Career Growth: Advice for engineers on leveraging new AI tools to augment expertise rather than replacing the need for deep domain knowledge.