Episode

Move K8s Stateful Pods Between Nodes

Podcast: DevOps and Docker Talk: Cloud Native Interviews and Tooling
Published: Oct 9, 2025
Duration seconds: 2682
Processing state: processed
Canonical source: https://podcast.bretfisher.com/episodes/move-k8s-stateful-pods-between-nodes
Audio: https://media.transistor.fm/dc3be907/cf37a395.mp3
JSON: /v1/public/podcasts/devops-and-docker-talk-cloud-native-interviews-and-tooling/episodes/move-k8s-stateful-pods-between-nodes
Markdown: /podcast/devops-and-docker-talk-cloud-native-interviews-and-tooling/move-k8s-stateful-pods-between-nodes.md

Actions

POST https://stenobird.com/v1/public/podcasts/devops-and-docker-talk-cloud-native-interviews-and-tooling/episodes/move-k8s-stateful-pods-between-nodes/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/devops-and-docker-talk-cloud-native-interviews-and-tooling/move-k8s-stateful-pods-between-nodes.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Explore the technical mechanics of live migrating Kubernetes pods between nodes without downtime or data loss. This deep dive covers how Cast AI maintains TCP connections, memory state, and IP addresses during real-time transitions.

Topics

Kubernetes
Live Migration
Cloud Native
DevOps
Container Networking
Stateful Workloads
Cast AI
Infrastructure Automation

Highlights

Main idea: Live migration solves the 'stateful workload' problem by moving running pod data and memory between nodes
Practical takeaway: Use live migration for seamless hardware maintenance, OS patching, and optimizing bin packing without service interruptions
Failure mode: Network bandwidth constraints and high-throughput disk replication can significantly increase migration latency
Technical challenge: Maintaining persistent IP addresses and TCP connections requires custom CNI plugin integration
Future trend: The evolution of live migration will likely extend to managing spot instance interruptions and on-premise Kubernetes environments

Chapters

1:00 The Problem with Pod Restarts: Discussing the risks of outages when pods are forced to restart or redeploy during node maintenance.
7:40 Solving Stateful Workload Challenges: Addressing the difficulty of managing stateful sets and daemonsets in Kubernetes clusters.
11:30 Infrastructure Efficiency and Bin Packing: Analyzing why Kubernetes clusters often suffer from low CPU utilization and how automation helps.
21:50 Networking and Bandwidth Constraints: Evaluating how network traffic and bandwidth impact the speed of memory replication during migration.
25:40 Cloud Provider Roadmap: A look at the timeline for expanding live migration support to EKS, GKE, and on-premise solutions.
29:10 Live Migration for Spot Instances: Discussing the potential for using live migration to handle the dynamic nature of spot instance availability.
39:40 The Engineering Behind the Migration: A deep dive into the year-long engineering effort required to snapshot workloads and move memory state.