# Fixing GPU Starvation in Large-Scale Distributed Training

Page: https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training
Text version: https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md
Podcast: [MLOps.community](https://stenobird.com/podcast/mlops-community)
Published: 2026-04-03T17:00:28+00:00
Episode link: https://podcasters.spotify.com/pod/show/mlops/episodes/Fixing-GPU-Starvation-in-Large-Scale-Distributed-Training-e3hcn48
Audio file: https://anchor.fm/s/174cb1b8/podcast/play/117905992/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-3%2F421342927-44100-2-a2a86203b9d3d.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training
Duration seconds: 3168

## Resource

Scaling machine learning models is rarely limited by model architecture, but rather by the underlying infrastructure and data I/O bottlenecks. Kashish Mittal details how optimizing data loading and caching layers can drastically increase GPU utilization and reduce training time.

## Highlights
- Main idea: Infrastructure, specifically data loading and I/O, is the primary hidden constraint when scaling ML workloads
- Practical takeaway: Redesigning caching layers to bypass CPU transformation bottlenecks can boost GPU utilization from low levels to over 60%
- Failure mode: Increasing parallelism too aggressively can introduce non-determinism and label skewness, degrading model quality
- Trade-off: Balancing latency and throughput in serving requires deciding between larger batch sizes and the serialization costs of fetching features
- Lesson: Efficient distributed training requires a full-stack profiling approach to identify whether the bottleneck lies in the producer, consumer, or network

## Topics

MLOps, Distributed Training, GPU Utilization, Data Engineering, Machine Learning Infrastructure, Petastorm, System Profiling, Large-scale Systems

## Chapters
- 1:00 — The Economics of AI Infrastructure: A discussion on the rising costs of compute and the decision-making process regarding human engineers versus AI agents.
- 4:55 — The Hidden Constraint: Data I/O: Why scaling models is often an infrastructure problem rather than an architectural one, focusing on data bottlenecks.
- 8:40 — Hardware-Specific Optimization: Exploring the differences between TPU and GPU environments and how hardware affects optimization strategies.
- 12:30 — Profiling the Data Pipeline: A deep dive into the producer-consumer architecture of Petastorm and identifying where data gets clogged.
- 16:40 — Implementing Local Caching: Strategies for using local disk/SSD to cache data and reduce expensive remote file system calls.
- 24:30 — The Cost of Batching: Analyzing the trade-offs between CPU and GPU execution and the overhead of increasing batch sizes.
- 28:35 — Risks of High Parallelism: How extreme parallelism can lead to data order non-determinism and impact model training stability.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.