Episode

Software Engineering in the Age of Coding Agents: Testing, Evals, and Shipping Safely at Scale

Podcast
MLOps.community
Published
Feb 10, 2026
Duration seconds
3444
Processing state
processed
Canonical source
https://podcasters.spotify.com/pod/show/mlops/episodes/Software-Engineering-in-the-Age-of-Coding-Agents-Testing--Evals--and-Shipping-Safely-at-Scale-e3eta9q
Audio
https://anchor.fm/s/174cb1b8/podcast/play/115304186/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-1-10%2F417834561-44100-2-32c1411bf9507.mp3
JSON
/v1/public/podcasts/mlops-community/episodes/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale
Markdown
/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Engineering agentic systems requires a hybrid approach between traditional software engineering and non-deterministic machine learning. This discussion explores how to manage complexity, evaluate performance, and maintain trust in autonomous AI workflows.

Topics

  • AI Agents
  • Software Engineering
  • LLM Evaluation
  • MLOps
  • Agentic Workflows
  • Observability
  • Cybersecurity AI
  • System Architecture

Highlights

  • Main idea: Agentic systems are a hybrid of deterministic software engineering and non-deterministic predictive modeling
  • Practical takeaway: Avoid over-engineering multi-agent graphs; use a single agent with well-defined 'skills' written in traditional code whenever possible
  • Failure mode: Over-reliance on LLMs for logic that can be handled by pure business logic leads to increased costs and decreased testability
  • Practical takeaway: Implement 'LLM as a judge' and integration tests that use real customer data to evaluate agent performance at scale
  • Failure mode: Neglecting the UX of observability; users need clear audit trails and 'reasoning' visibility to trust autonomous actions

Chapters

  1. 1:00 The Value of AI Coding Assistants: An exploration of how tools like Claude Code and Cursor are significantly increasing developer velocity and the ROI of AI-driven coding.
  2. 5:20 The Shift to Hybrid Engineering: Discussing the transition from traditional software requirements to managing systems that blend deterministic code with probabilistic prompts.
  3. 18:10 Observability and Audit Trails: The necessity of building transparent audit trails so users can understand the reasoning behind an agent's specific actions.
  4. 31:05 Language Sensitivity in Prompts: How subtle changes in prompt wording can trigger different reasoning paths and the importance of versioning prompt lineage.
  5. 39:45 Evaluating Agentic Workflows: Strategies for implementing two-level evaluations: unit-test style integration tests and large-scale evaluations against real-world data.
  6. 52:50 Architectural Minimalism: A critique of complex multi-agent graphs and the argument for keeping architectures simple by managing context effectively within a single agent.