# Software Engineering in the Age of Coding Agents: Testing, Evals, and Shipping Safely at Scale

Page: https://stenobird.com/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale
Text version: https://stenobird.com/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale.md
Podcast: [MLOps.community](https://stenobird.com/podcast/mlops-community)
Published: 2026-02-10T18:00:07+00:00
Episode link: https://podcasters.spotify.com/pod/show/mlops/episodes/Software-Engineering-in-the-Age-of-Coding-Agents-Testing--Evals--and-Shipping-Safely-at-Scale-e3eta9q
Audio file: https://anchor.fm/s/174cb1b8/podcast/play/115304186/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-1-10%2F417834561-44100-2-32c1411bf9507.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/mlops-community/episodes/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale
Duration seconds: 3444

## Resource

Engineering agentic systems requires a hybrid approach between traditional software engineering and non-deterministic machine learning. This discussion explores how to manage complexity, evaluate performance, and maintain trust in autonomous AI workflows.

## Highlights
- Main idea: Agentic systems are a hybrid of deterministic software engineering and non-deterministic predictive modeling
- Practical takeaway: Avoid over-engineering multi-agent graphs; use a single agent with well-defined 'skills' written in traditional code whenever possible
- Failure mode: Over-reliance on LLMs for logic that can be handled by pure business logic leads to increased costs and decreased testability
- Practical takeaway: Implement 'LLM as a judge' and integration tests that use real customer data to evaluate agent performance at scale
- Failure mode: Neglecting the UX of observability; users need clear audit trails and 'reasoning' visibility to trust autonomous actions

## Topics

AI Agents, Software Engineering, LLM Evaluation, MLOps, Agentic Workflows, Observability, Cybersecurity AI, System Architecture

## Chapters
- 1:00 — The Value of AI Coding Assistants: An exploration of how tools like Claude Code and Cursor are significantly increasing developer velocity and the ROI of AI-driven coding.
- 5:20 — The Shift to Hybrid Engineering: Discussing the transition from traditional software requirements to managing systems that blend deterministic code with probabilistic prompts.
- 18:10 — Observability and Audit Trails: The necessity of building transparent audit trails so users can understand the reasoning behind an agent's specific actions.
- 31:05 — Language Sensitivity in Prompts: How subtle changes in prompt wording can trigger different reasoning paths and the importance of versioning prompt lineage.
- 39:45 — Evaluating Agentic Workflows: Strategies for implementing two-level evaluations: unit-test style integration tests and large-scale evaluations against real-world data.
- 52:50 — Architectural Minimalism: A critique of complex multi-agent graphs and the argument for keeping architectures simple by managing context effectively within a single agent.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.