Episode

Software Engineering in the Age of Coding Agents: Testing, Evals, and Shipping Safely at Scale

Podcast: MLOps.community
Published: Feb 10, 2026
Duration seconds: 3444
Processing state: processed
Canonical source: https://podcasters.spotify.com/pod/show/mlops/episodes/Software-Engineering-in-the-Age-of-Coding-Agents-Testing--Evals--and-Shipping-Safely-at-Scale-e3eta9q
Audio: https://anchor.fm/s/174cb1b8/podcast/play/115304186/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-1-10%2F417834561-44100-2-32c1411bf9507.mp3
JSON: /v1/public/podcasts/mlops-community/episodes/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale
Markdown: /podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale.md

Actions

POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Engineering agentic systems requires a hybrid approach between traditional software engineering and non-deterministic machine learning. This discussion explores how to manage complexity, evaluate performance, and maintain trust in autonomous AI workflows.

Topics

AI Agents
Software Engineering
LLM Evaluation
MLOps
Agentic Workflows
Observability
Cybersecurity AI
System Architecture

Highlights

Main idea: Agentic systems are a hybrid of deterministic software engineering and non-deterministic predictive modeling
Practical takeaway: Avoid over-engineering multi-agent graphs; use a single agent with well-defined 'skills' written in traditional code whenever possible
Failure mode: Over-reliance on LLMs for logic that can be handled by pure business logic leads to increased costs and decreased testability
Practical takeaway: Implement 'LLM as a judge' and integration tests that use real customer data to evaluate agent performance at scale
Failure mode: Neglecting the UX of observability; users need clear audit trails and 'reasoning' visibility to trust autonomous actions

Chapters

1:00 The Value of AI Coding Assistants: An exploration of how tools like Claude Code and Cursor are significantly increasing developer velocity and the ROI of AI-driven coding.
5:20 The Shift to Hybrid Engineering: Discussing the transition from traditional software requirements to managing systems that blend deterministic code with probabilistic prompts.
18:10 Observability and Audit Trails: The necessity of building transparent audit trails so users can understand the reasoning behind an agent's specific actions.
31:05 Language Sensitivity in Prompts: How subtle changes in prompt wording can trigger different reasoning paths and the importance of versioning prompt lineage.
39:45 Evaluating Agentic Workflows: Strategies for implementing two-level evaluations: unit-test style integration tests and large-scale evaluations against real-world data.
52:50 Architectural Minimalism: A critique of complex multi-agent graphs and the argument for keeping architectures simple by managing context effectively within a single agent.