Episode

Move K8s Stateful Pods Between Nodes

Podcast
DevOps and Docker Talk: Cloud Native Interviews and Tooling
Published
Oct 9, 2025
Duration seconds
2819
Processing state
processed
Canonical source
https://podcast.bretfisher.com/episodes/move-k8s-stateful-pods-between-nodes
Audio
https://media.transistor.fm/dc3be907/cf37a395.mp3
JSON
/v1/public/podcasts/devops-and-docker-talk-cloud-native-interviews-and-tooling/episodes/move-k8s-stateful-pods-between-nodes
Markdown
/podcast/devops-and-docker-talk-cloud-native-interviews-and-tooling/move-k8s-stateful-pods-between-nodes.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/devops-and-docker-talk-cloud-native-interviews-and-tooling/episodes/move-k8s-stateful-pods-between-nodes/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/devops-and-docker-talk-cloud-native-interviews-and-tooling/move-k8s-stateful-pods-between-nodes.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Explore the technical mechanics of live migrating Kubernetes pods between nodes without downtime or data loss. This deep dive covers how Cast AI maintains TCP connections, memory state, and IP addresses during real-time transitions.

Topics

  • Kubernetes
  • Live Migration
  • Cloud Native
  • DevOps
  • Container Networking
  • Stateful Workloads
  • Cast AI
  • Infrastructure Automation

Highlights

  • Main idea: Live migration solves the 'stateful workload' problem by moving running pod data and memory between nodes
  • Practical takeaway: Use live migration for seamless hardware maintenance, OS patching, and optimizing bin packing without service interruptions
  • Failure mode: Network bandwidth constraints and high-throughput disk replication can significantly increase migration latency
  • Technical challenge: Maintaining persistent IP addresses and TCP connections requires custom CNI plugin integration
  • Future trend: The evolution of live migration will likely extend to managing spot instance interruptions and on-premise Kubernetes environments

Chapters

  1. 1:00 The Problem with Pod Restarts: Discussing the risks of outages when pods are forced to restart or redeploy during node maintenance.
  2. 7:40 Solving Stateful Workload Challenges: Addressing the difficulty of managing stateful sets and daemonsets in Kubernetes clusters.
  3. 11:30 Infrastructure Efficiency and Bin Packing: Analyzing why Kubernetes clusters often suffer from low CPU utilization and how automation helps.
  4. 21:50 Networking and Bandwidth Constraints: Evaluating how network traffic and bandwidth impact the speed of memory replication during migration.
  5. 25:40 Cloud Provider Roadmap: A look at the timeline for expanding live migration support to EKS, GKE, and on-premise solutions.
  6. 29:10 Live Migration for Spot Instances: Discussing the potential for using live migration to handle the dynamic nature of spot instance availability.
  7. 39:40 The Engineering Behind the Migration: A deep dive into the year-long engineering effort required to snapshot workloads and move memory state.