Thinking about what agent behaviors are worth protecting through evals raises the question: surely, not all behaviors are the same?
Fundamentally, it boils down to three categories:
- Happy paths
- Edge cases, and
- Failure modes
In any user journey, people are expected to breeze through the happy paths and not encounter any hiccups along the way. In software, these are typically protected via unit, integration and end-to-end tests.
The inversion, however, with agents is that the happy path can now be traversed in an infinite number of ways, i.e. as expressed by a prompt in natural language and outputs sampled by the LLMs.
Your happy path quickly starts to look more like edge cases, the more it veers into unspecified or ambiguous intent territory.
And yes, clever guardrailing, proper tool use schema, and better agent ergonomics are the usual alleviations. But manually iterating on that continuous space is arduous and often won't reveal blind spots until much later.
The 2x2
Applying Rumsfeld's classic 2x2 matrix over knowns and unknowns illuminates the types of agent behaviors we should protect in our codebases. It surprisingly lines up, like so: happy paths are your known-knowns, edge cases the known-unknowns, and failure modes the unknown-unknowns you simply haven't discovered yet.
Behavior you can state and want to protect.
A dimension you worry about but haven't pinned down.
Implicit behaviors nobody wrote down.
Failures you didn't know to look for.
Evals push behaviors up and left into known knowns (tests).
It's helped me have a clearer mental model of how and where we ought to focus at different phases of the agent development lifecycle. For example, early on you'd only care about known-knowns, and over time, invest in bringing unknown-unknowns to light.
Leftward
Once you pin behaviors down, the whole endeavor really becomes one of surfacing those latent unknown-unknown failures and promoting them to known-knowns as evals/tests that live in your codebase.
How that'd typically work in practice:
- Unknown-unknown → mine for failures → known-unknown → you decide if it's a genuine bug → pinned regression as a test (now a known-known)
- Known-unknown → scenario probe resolves it → known-known
To recap, failures you discover are unknown-unknowns, behaviors you protect are known-knowns, and edge cases are the known-unknown region you choose to probe.
Think of it like a conveyor belt moving behaviors leftward continuously.
In this framing, it becomes clear that observability and evals are really two sides of the same coin: you need to first record and observe the offending behaviors to then guard against them.
As agents leave trails of their work, mining for instances of behavioral misalignment and then course-correcting will become table stakes.
Protect your happy paths.