# Fixing GPU Starvation in Large-Scale Distributed Training Page: https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training Text version: https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md Podcast: [MLOps.community](https://stenobird.com/podcast/mlops-community) Published: 2026-04-03T17:00:28+00:00 Episode link: https://podcasters.spotify.com/pod/show/mlops/episodes/Fixing-GPU-Starvation-in-Large-Scale-Distributed-Training-e3hcn48 Audio file: https://anchor.fm/s/174cb1b8/podcast/play/117905992/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-3%2F421342927-44100-2-a2a86203b9d3d.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training Duration seconds: 3168 ## Resource Scaling machine learning models is rarely limited by model architecture, but rather by the underlying infrastructure and data I/O bottlenecks. Kashish Mittal details how optimizing data loading and caching layers can drastically increase GPU utilization and reduce training time. ## Highlights - Main idea: Infrastructure, specifically data loading and I/O, is the primary hidden constraint when scaling ML workloads - Practical takeaway: Redesigning caching layers to bypass CPU transformation bottlenecks can boost GPU utilization from low levels to over 60% - Failure mode: Increasing parallelism too aggressively can introduce non-determinism and label skewness, degrading model quality - Trade-off: Balancing latency and throughput in serving requires deciding between larger batch sizes and the serialization costs of fetching features - Lesson: Efficient distributed training requires a full-stack profiling approach to identify whether the bottleneck lies in the producer, consumer, or network ## Topics MLOps, Distributed Training, GPU Utilization, Data Engineering, Machine Learning Infrastructure, Petastorm, System Profiling, Large-scale Systems ## Chapters - 1:00 — The Economics of AI Infrastructure: A discussion on the rising costs of compute and the decision-making process regarding human engineers versus AI agents. - 4:55 — The Hidden Constraint: Data I/O: Why scaling models is often an infrastructure problem rather than an architectural one, focusing on data bottlenecks. - 8:40 — Hardware-Specific Optimization: Exploring the differences between TPU and GPU environments and how hardware affects optimization strategies. - 12:30 — Profiling the Data Pipeline: A deep dive into the producer-consumer architecture of Petastorm and identifying where data gets clogged. - 16:40 — Implementing Local Caching: Strategies for using local disk/SSD to cache data and reduce expensive remote file system calls. - 24:30 — The Cost of Batching: Analyzing the trade-offs between CPU and GPU execution and the overhead of increasing batch sizes. - 28:35 — Risks of High Parallelism: How extreme parallelism can lead to data order non-determinism and impact model training stability. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.