Episode
Move K8s Stateful Pods Between Nodes
- Published
- Oct 9, 2025
- Duration seconds
- 2819
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/devops-and-docker-talk-cloud-native-interviews-and-tooling/episodes/move-k8s-stateful-pods-between-nodes/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/devops-and-docker-talk-cloud-native-interviews-and-tooling/move-k8s-stateful-pods-between-nodes.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Explore the technical mechanics of live migrating Kubernetes pods between nodes without downtime or data loss. This deep dive covers how Cast AI maintains TCP connections, memory state, and IP addresses during real-time transitions.
Topics
- Kubernetes
- Live Migration
- Cloud Native
- DevOps
- Container Networking
- Stateful Workloads
- Cast AI
- Infrastructure Automation
Highlights
- Main idea: Live migration solves the 'stateful workload' problem by moving running pod data and memory between nodes
- Practical takeaway: Use live migration for seamless hardware maintenance, OS patching, and optimizing bin packing without service interruptions
- Failure mode: Network bandwidth constraints and high-throughput disk replication can significantly increase migration latency
- Technical challenge: Maintaining persistent IP addresses and TCP connections requires custom CNI plugin integration
- Future trend: The evolution of live migration will likely extend to managing spot instance interruptions and on-premise Kubernetes environments
Chapters
1:00The Problem with Pod Restarts: Discussing the risks of outages when pods are forced to restart or redeploy during node maintenance.7:40Solving Stateful Workload Challenges: Addressing the difficulty of managing stateful sets and daemonsets in Kubernetes clusters.11:30Infrastructure Efficiency and Bin Packing: Analyzing why Kubernetes clusters often suffer from low CPU utilization and how automation helps.21:50Networking and Bandwidth Constraints: Evaluating how network traffic and bandwidth impact the speed of memory replication during migration.25:40Cloud Provider Roadmap: A look at the timeline for expanding live migration support to EKS, GKE, and on-premise solutions.29:10Live Migration for Spot Instances: Discussing the potential for using live migration to handle the dynamic nature of spot instance availability.39:40The Engineering Behind the Migration: A deep dive into the year-long engineering effort required to snapshot workloads and move memory state.