How to Test Autonomous Agents: Evaluation Frameworks for Production Reliability
You cannot unit test an LLM. You can systematically evaluate agent behaviour, tool call accuracy, output consistency, and failure recovery. Here is the evaluation framework we apply before every production agentic deployment.
Agent evaluation requires measuring four dimensions simultaneously: task completion rate, tool call accuracy, output quality, and failure recovery behaviour. No single framework covers all four — production teams combine RAGAS for RAG pipeline metrics, Braintrust for human-feedback-based quality scoring, and LangSmith for execution tracing. The foundation of all of it is a well-constructed golden dataset that represents real production inputs.
Why You Cannot Unit Test an LLM Agent
Unit testing assumes deterministic behaviour: given input X, expect output Y. LLMs are probabilistic — the same input produces different outputs across runs, and the 'correctness' of an output is often a matter of degree, not binary pass/fail. A test that asserts exact string equality on an LLM output will be brittle, noisy, and misleading.
This does not mean you cannot evaluate agents systematically. It means the evaluation paradigm has to shift from assertion-based testing to statistical measurement over a sample of inputs. Instead of 'this response equals the expected string,' you measure 'across 200 test inputs, this agent completes the task correctly 94% of the time, makes the correct tool call 97% of the time, and recovers gracefully from tool failures 89% of the time.'
The distinction matters because it changes how you think about regressions. A prompt change that shifts task completion rate from 94% to 88% is a regression worth reverting. But the specific outputs will still differ run to run — and that is fine. The goal is statistical reliability, not deterministic reproducibility. Once teams accept this, they stop fighting the probabilistic nature of LLMs and start building evaluation infrastructure that actually works.
The Four Evaluation Dimensions for Production Agents
Task Completion Rate
The percentage of test inputs for which the agent successfully achieves the stated goal. This is the top-level metric — everything else is diagnostic. Define 'task complete' precisely: for a support routing agent, completion means a ticket is routed with a valid queue and priority assigned. Measure this across your golden dataset on every deployment. A target of 95%+ is achievable for well-scoped workflows; below 90% means the agent is not production-ready.
Tool Call Accuracy
The percentage of tool calls made with correct parameters and in the correct sequence. This is separate from task completion because an agent can complete a task while making suboptimal or incorrect tool calls along the way — which matters for downstream systems, cost, and auditability. Measure: (a) correct tool selected, (b) correct parameters passed, (c) unnecessary tool calls avoided. Log every tool call with its parameters during evaluation runs.
Output Quality
The semantic quality of the agent's final output — assessed either by LLM-as-judge (auto-eval) or human reviewers (Braintrust). For RAG agents, RAGAS provides specific metrics: faithfulness (is the answer grounded in retrieved context?), answer relevancy (does it address the query?), and context recall (was the relevant context retrieved?). For generation agents, LLM-as-judge scoring on rubric dimensions (accuracy, completeness, tone) is the scalable approach.
Failure Recovery Behaviour
How the agent behaves when tools fail, context is ambiguous, or it reaches an impasse. Inject controlled failures into your test environment: simulate tool timeouts, return empty results, provide contradictory context. Measure whether the agent retries appropriately, falls back gracefully, escalates to human review when appropriate, or halts cleanly rather than producing a confident-sounding wrong answer.
Building a Golden Dataset
A golden dataset is the single most valuable investment in agent reliability. It is a curated set of (input, expected_behaviour) pairs drawn from real production data. Here is how we build one:
Sample from production logs
Run your agent in shadow mode (process real inputs but do not act on outputs) for 1-2 weeks. Sample 200-500 inputs across the full distribution — do not cherry-pick easy cases. Include ambiguous inputs, edge cases, and known hard cases.
Label by domain experts, not engineers
The people who know what 'correct' looks like for your workflow are domain experts, not the engineers who built the agent. A support routing agent's golden labels should be verified by senior support engineers. Budget 3-5 minutes per example for labelling.
Record expected tool call sequences, not just outputs
For each golden input, record not just the expected final output but the expected tool calls and their order. This gives you the data needed to evaluate tool call accuracy separately from output quality.
Version your golden dataset with your codebase
Store the golden dataset in your repo alongside the agent code. When you make a prompt change, run the full golden dataset evaluation and diff the metrics against the previous version before merging. This is your regression test for agent behaviour.
Agent Evaluation Harness with RAGAS and Braintrust
import json
import asyncio
from dataclasses import dataclass, field
from typing import Callable, Any
from langsmith import Client as LangSmithClient
from ragas import evaluate as ragas_evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset
from braintrust import Eval, Score
from openai import OpenAI
@dataclass
class GoldenExample:
input: str
expected_output: str
expected_tool_calls: list[str] # ordered list of expected tool names
context_docs: list[str] # relevant docs (for RAG eval)
metadata: dict = field(default_factory=dict)
@dataclass
class EvalResult:
input: str
actual_output: str
actual_tool_calls: list[str]
retrieved_context: list[str]
task_complete: bool
tool_call_accuracy: float
faithfulness_score: float
relevancy_score: float
latency_ms: float
class AgentEvaluationHarness:
"""
Evaluation harness combining RAGAS metrics, tool call accuracy,
and Braintrust human eval integration.
"""
def __init__(self, agent_fn: Callable, golden_dataset_path: str):
self.agent_fn = agent_fn # async fn(input: str) -> dict
self.golden_dataset = self._load_golden(golden_dataset_path)
self.langsmith = LangSmithClient()
self.openai = OpenAI()
def _load_golden(self, path: str) -> list[GoldenExample]:
with open(path) as f:
raw = json.load(f)
return [GoldenExample(**ex) for ex in raw["examples"]]
def _tool_call_accuracy(self, expected: list[str], actual: list[str]) -> float:
if not expected:
return 1.0 if not actual else 0.0
correct = sum(1 for e, a in zip(expected, actual) if e == a)
coverage = correct / len(expected)
length_penalty = max(0.0, 1.0 - abs(len(actual) - len(expected)) * 0.1)
return coverage * length_penalty
def _llm_judge_task_complete(self, input_: str, output: str, expected: str) -> bool:
prompt = f"""Does the following agent output successfully address the task?
Task input: {input_}
Expected behaviour: {expected}
Actual output: {output}
Answer only YES or NO."""
resp = self.openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=5,
)
return resp.choices[0].message.content.strip().upper() == "YES"
async def run_single(self, example: GoldenExample) -> EvalResult:
import time
start = time.perf_counter()
result = await self.agent_fn(example.input)
latency_ms = (time.perf_counter() - start) * 1000
actual_output = result.get("output", "")
actual_tools = result.get("tool_calls", [])
retrieved_ctx = result.get("retrieved_context", [])
tool_acc = self._tool_call_accuracy(example.expected_tool_calls, actual_tools)
task_ok = self._llm_judge_task_complete(
example.input, actual_output, example.expected_output
)
# RAGAS evaluation for RAG quality
ragas_data = Dataset.from_dict({
"question": [example.input],
"answer": [actual_output],
"contexts": [retrieved_ctx],
"ground_truth": [example.expected_output],
})
ragas_result = ragas_evaluate(
ragas_data, metrics=[faithfulness, answer_relevancy, context_recall]
)
return EvalResult(
input=example.input,
actual_output=actual_output,
actual_tool_calls=actual_tools,
retrieved_context=retrieved_ctx,
task_complete=task_ok,
tool_call_accuracy=tool_acc,
faithfulness_score=ragas_result["faithfulness"],
relevancy_score=ragas_result["answer_relevancy"],
latency_ms=latency_ms,
)
async def run_full_eval(self) -> dict:
results = await asyncio.gather(
*[self.run_single(ex) for ex in self.golden_dataset]
)
n = len(results)
summary = {
"n": n,
"task_completion_rate": sum(r.task_complete for r in results) / n,
"avg_tool_call_accuracy": sum(r.tool_call_accuracy for r in results) / n,
"avg_faithfulness": sum(r.faithfulness_score for r in results) / n,
"avg_relevancy": sum(r.relevancy_score for r in results) / n,
"p50_latency_ms": sorted(r.latency_ms for r in results)[n // 2],
"p99_latency_ms": sorted(r.latency_ms for r in results)[int(n * 0.99)],
}
print(json.dumps(summary, indent=2))
return summary
# Usage — wire in your agent function
async def my_agent(input_text: str) -> dict:
# Replace with your actual agent call
return {"output": "...", "tool_calls": ["search_kb", "route_ticket"],
"retrieved_context": ["doc1 text", "doc2 text"]}
if __name__ == "__main__":
harness = AgentEvaluationHarness(
agent_fn=my_agent,
golden_dataset_path="eval/golden_dataset_v3.json"
)
asyncio.run(harness.run_full_eval())Evaluation harness combining RAGAS faithfulness/relevancy metrics with tool call accuracy scoring and LLM-as-judge task completion assessment. Run this against your golden dataset on every prompt change or model upgrade. The run_full_eval summary gives you the four key metrics needed to make a go/no-go deployment decision.
LLM-as-judge evaluation (using GPT-4o-mini to assess whether another model's output is correct) is 10-20x cheaper than human annotation and achieves 85-90% agreement with expert human labels on most well-defined tasks. Use GPT-4o-mini for binary task completion judgments in CI/CD pipelines, and reserve human annotation via Braintrust for the 10-15% of cases where the LLM judge is uncertain or for calibrating new rubrics.
Evaluation-Driven Deployment Checklist
Establish baseline metrics before any deployment
Run your evaluation harness against the golden dataset on the current production version to establish baseline numbers. All future deployments are evaluated against this baseline. Never deploy to production without a baseline to compare against.
Gate deployments on task completion rate
Set a hard threshold: no deployment goes live if task completion rate drops more than 2 percentage points from baseline. For most production agents, this means requiring 93%+ task completion rate minimum.
Run eval on every prompt change in CI/CD
Add the evaluation harness to your CI pipeline. A PR that changes a system prompt should automatically run the golden dataset eval and post metrics as a PR comment. Treat a metric regression as a test failure.
Grow the golden dataset continuously
Add 10-20 new examples to the golden dataset monthly, drawn from production cases that were escalated, flagged, or reviewed. The golden dataset should grow to 500+ examples over time to catch subtle regressions.
Track metrics over time with dashboards
Export eval metrics to your observability stack (Grafana, Datadog, or Braintrust's built-in dashboard). Trending metrics over time reveal gradual model drift — a real phenomenon as base model providers update underlying models.
Inductivee's Evaluation Baseline Requirements
Before we sign off on any agentic deployment going into production, we require a golden dataset of at minimum 200 examples, a task completion rate of 93%+, tool call accuracy of 95%+, and RAGAS faithfulness score of 0.85+ for any agent that incorporates retrieval. These thresholds reflect what we have found separates 'impressive demo' from 'reliable production system' across 200+ deployments.
The number we care most about in practice is tool call accuracy. An agent that sometimes generates a plausible-sounding but incorrect final answer is annoying. An agent that calls the wrong tool or passes incorrect parameters can corrupt downstream systems, send the wrong email, or write incorrect data to a database. Tool call accuracy is the metric most directly connected to production safety, and it is the one teams most often skip measuring.
For teams just starting out: build the golden dataset first, before writing a single line of agent code. The process of labelling what 'correct' looks like will surface ambiguities in the requirements that would otherwise show up as production failures.
Frequently Asked Questions
How do you evaluate LLM agent reliability?
What is RAGAS and when should I use it?
What task completion rate is acceptable for a production agent?
How do I build an LLM evaluation golden dataset?
What is LLM-as-judge evaluation?
Written By
Inductivee Team
AuthorAgentic AI Engineering Team
The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.
Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.
Engineer This With Inductivee
The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.
Agentic Custom Software Engineering
We engineer autonomous agentic systems that orchestrate enterprise workflows and unlock the hidden liquidity of your proprietary data.
ServiceAutonomous Agentic SaaS
Agentic SaaS development and autonomous platform engineering — we build SaaS products whose core loop is powered by LangGraph and CrewAI agents that execute workflows, not just manage them.
Related Articles
Agentic Workflow Automation: Moving Beyond Single-Task AI to End-to-End Orchestration
Enterprise AI Governance: Building the Framework Before You Desperately Need It
RAG Evaluation in Production: Building Continuous Quality Monitoring
Ready to Build This Into Your Enterprise?
Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.
Start a Project