$ head -n 1
A look at an evaluation stack for coding agents: provider smoke tests, transform contract tests, end-to-end tasks, metrics, and reporting patterns for catching regressions.
$ grep -i "why evals"
Coding-agent quality is hard to judge from one successful task. The useful question is whether a setup keeps working across providers, prompts, tool formats, and repeated runs.
$ grep -i "layers"
The lab shape is layered: contract tests for provider transforms and tool-call parsing, smoke tests for quick provider validation, and heavier end-to-end tasks for real-world bug-fix behavior.
Metrics such as pass@k, pass^k, and flakiness help separate capability from reliability. A model that sometimes solves a task is different from a model that solves it consistently.
$ grep -i "evaluation discipline"
The useful habit is measuring agent behavior before trusting it for larger automation. A small eval stack cannot prove a model is safe, but it can catch broken provider transforms, brittle tool-call handling, and regressions that would otherwise surface during real work.