Episode

Fixing GPU Starvation in Large-Scale Distributed Training

Podcast: MLOps.community
Published: Apr 3, 2026
Duration seconds: 3168
Processing state: processed
Canonical source: https://podcasters.spotify.com/pod/show/mlops/episodes/Fixing-GPU-Starvation-in-Large-Scale-Distributed-Training-e3hcn48
Audio: https://anchor.fm/s/174cb1b8/podcast/play/117905992/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-3%2F421342927-44100-2-a2a86203b9d3d.mp3
JSON: /v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training
Markdown: /podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md

Actions

POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Scaling machine learning models is rarely limited by model architecture, but rather by the underlying infrastructure and data I/O bottlenecks. Kashish Mittal details how optimizing data loading and caching layers can drastically increase GPU utilization and reduce training time.

Topics

MLOps
Distributed Training
GPU Utilization
Data Engineering
Machine Learning Infrastructure
Petastorm
System Profiling
Large-scale Systems

Highlights

Main idea: Infrastructure, specifically data loading and I/O, is the primary hidden constraint when scaling ML workloads
Practical takeaway: Redesigning caching layers to bypass CPU transformation bottlenecks can boost GPU utilization from low levels to over 60%
Failure mode: Increasing parallelism too aggressively can introduce non-determinism and label skewness, degrading model quality
Trade-off: Balancing latency and throughput in serving requires deciding between larger batch sizes and the serialization costs of fetching features
Lesson: Efficient distributed training requires a full-stack profiling approach to identify whether the bottleneck lies in the producer, consumer, or network

Chapters

1:00 The Economics of AI Infrastructure: A discussion on the rising costs of compute and the decision-making process regarding human engineers versus AI agents.
4:55 The Hidden Constraint: Data I/O: Why scaling models is often an infrastructure problem rather than an architectural one, focusing on data bottlenecks.
8:40 Hardware-Specific Optimization: Exploring the differences between TPU and GPU environments and how hardware affects optimization strategies.
12:30 Profiling the Data Pipeline: A deep dive into the producer-consumer architecture of Petastorm and identifying where data gets clogged.
16:40 Implementing Local Caching: Strategies for using local disk/SSD to cache data and reduce expensive remote file system calls.
24:30 The Cost of Batching: Analyzing the trade-offs between CPU and GPU execution and the overhead of increasing batch sizes.
28:35 Risks of High Parallelism: How extreme parallelism can lead to data order non-determinism and impact model training stability.