Skip to main content
Architecture

How to Test Autonomous Agents: Evaluation Frameworks for Production Reliability

You cannot unit test an LLM. You can systematically evaluate agent behaviour, tool call accuracy, output consistency, and failure recovery. Here is the evaluation framework we apply before every production agentic deployment.

Inductivee Team· AI EngineeringSeptember 25, 2025(updated April 15, 2026)13 min read
TL;DR

Agent evaluation requires measuring four dimensions simultaneously: task completion rate, tool call accuracy, output quality, and failure recovery behaviour. No single framework covers all four — production teams combine RAGAS for RAG pipeline metrics, Braintrust for human-feedback-based quality scoring, and LangSmith for execution tracing. The foundation of all of it is a well-constructed golden dataset that represents real production inputs.

Why You Cannot Unit Test an LLM Agent

Unit testing assumes deterministic behaviour: given input X, expect output Y. LLMs are probabilistic — the same input produces different outputs across runs, and the 'correctness' of an output is often a matter of degree, not binary pass/fail. A test that asserts exact string equality on an LLM output will be brittle, noisy, and misleading.

This does not mean you cannot evaluate agents systematically. It means the evaluation paradigm has to shift from assertion-based testing to statistical measurement over a sample of inputs. Instead of 'this response equals the expected string,' you measure 'across 200 test inputs, this agent completes the task correctly 94% of the time, makes the correct tool call 97% of the time, and recovers gracefully from tool failures 89% of the time.'

The distinction matters because it changes how you think about regressions. A prompt change that shifts task completion rate from 94% to 88% is a regression worth reverting. But the specific outputs will still differ run to run — and that is fine. The goal is statistical reliability, not deterministic reproducibility. Once teams accept this, they stop fighting the probabilistic nature of LLMs and start building evaluation infrastructure that actually works.

The Four Evaluation Dimensions for Production Agents

Task Completion Rate

The percentage of test inputs for which the agent successfully achieves the stated goal. This is the top-level metric — everything else is diagnostic. Define 'task complete' precisely: for a support routing agent, completion means a ticket is routed with a valid queue and priority assigned. Measure this across your golden dataset on every deployment. A target of 95%+ is achievable for well-scoped workflows; below 90% means the agent is not production-ready.

Tool Call Accuracy

The percentage of tool calls made with correct parameters and in the correct sequence. This is separate from task completion because an agent can complete a task while making suboptimal or incorrect tool calls along the way — which matters for downstream systems, cost, and auditability. Measure: (a) correct tool selected, (b) correct parameters passed, (c) unnecessary tool calls avoided. Log every tool call with its parameters during evaluation runs.

Output Quality

The semantic quality of the agent's final output — assessed either by LLM-as-judge (auto-eval) or human reviewers (Braintrust). For RAG agents, RAGAS provides specific metrics: faithfulness (is the answer grounded in retrieved context?), answer relevancy (does it address the query?), and context recall (was the relevant context retrieved?). For generation agents, LLM-as-judge scoring on rubric dimensions (accuracy, completeness, tone) is the scalable approach.

Failure Recovery Behaviour

How the agent behaves when tools fail, context is ambiguous, or it reaches an impasse. Inject controlled failures into your test environment: simulate tool timeouts, return empty results, provide contradictory context. Measure whether the agent retries appropriately, falls back gracefully, escalates to human review when appropriate, or halts cleanly rather than producing a confident-sounding wrong answer.

Building a Golden Dataset

A golden dataset is the single most valuable investment in agent reliability. It is a curated set of (input, expected_behaviour) pairs drawn from real production data. Here is how we build one:

Sample from production logs

Run your agent in shadow mode (process real inputs but do not act on outputs) for 1-2 weeks. Sample 200-500 inputs across the full distribution — do not cherry-pick easy cases. Include ambiguous inputs, edge cases, and known hard cases.

Label by domain experts, not engineers

The people who know what 'correct' looks like for your workflow are domain experts, not the engineers who built the agent. A support routing agent's golden labels should be verified by senior support engineers. Budget 3-5 minutes per example for labelling.

Record expected tool call sequences, not just outputs

For each golden input, record not just the expected final output but the expected tool calls and their order. This gives you the data needed to evaluate tool call accuracy separately from output quality.

Version your golden dataset with your codebase

Store the golden dataset in your repo alongside the agent code. When you make a prompt change, run the full golden dataset evaluation and diff the metrics against the previous version before merging. This is your regression test for agent behaviour.

Agent Evaluation Harness with RAGAS and Braintrust

python
import json
import asyncio
from dataclasses import dataclass, field
from typing import Callable, Any
from langsmith import Client as LangSmithClient
from ragas import evaluate as ragas_evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset
from braintrust import Eval, Score
from openai import OpenAI


@dataclass
class GoldenExample:
    input: str
    expected_output: str
    expected_tool_calls: list[str]  # ordered list of expected tool names
    context_docs: list[str]  # relevant docs (for RAG eval)
    metadata: dict = field(default_factory=dict)


@dataclass
class EvalResult:
    input: str
    actual_output: str
    actual_tool_calls: list[str]
    retrieved_context: list[str]
    task_complete: bool
    tool_call_accuracy: float
    faithfulness_score: float
    relevancy_score: float
    latency_ms: float


class AgentEvaluationHarness:
    """
    Evaluation harness combining RAGAS metrics, tool call accuracy,
    and Braintrust human eval integration.
    """

    def __init__(self, agent_fn: Callable, golden_dataset_path: str):
        self.agent_fn = agent_fn  # async fn(input: str) -> dict
        self.golden_dataset = self._load_golden(golden_dataset_path)
        self.langsmith = LangSmithClient()
        self.openai = OpenAI()

    def _load_golden(self, path: str) -> list[GoldenExample]:
        with open(path) as f:
            raw = json.load(f)
        return [GoldenExample(**ex) for ex in raw["examples"]]

    def _tool_call_accuracy(self, expected: list[str], actual: list[str]) -> float:
        if not expected:
            return 1.0 if not actual else 0.0
        correct = sum(1 for e, a in zip(expected, actual) if e == a)
        coverage = correct / len(expected)
        length_penalty = max(0.0, 1.0 - abs(len(actual) - len(expected)) * 0.1)
        return coverage * length_penalty

    def _llm_judge_task_complete(self, input_: str, output: str, expected: str) -> bool:
        prompt = f"""Does the following agent output successfully address the task?
Task input: {input_}
Expected behaviour: {expected}
Actual output: {output}
Answer only YES or NO."""
        resp = self.openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            max_tokens=5,
        )
        return resp.choices[0].message.content.strip().upper() == "YES"

    async def run_single(self, example: GoldenExample) -> EvalResult:
        import time
        start = time.perf_counter()
        result = await self.agent_fn(example.input)
        latency_ms = (time.perf_counter() - start) * 1000

        actual_output = result.get("output", "")
        actual_tools = result.get("tool_calls", [])
        retrieved_ctx = result.get("retrieved_context", [])

        tool_acc = self._tool_call_accuracy(example.expected_tool_calls, actual_tools)
        task_ok = self._llm_judge_task_complete(
            example.input, actual_output, example.expected_output
        )

        # RAGAS evaluation for RAG quality
        ragas_data = Dataset.from_dict({
            "question": [example.input],
            "answer": [actual_output],
            "contexts": [retrieved_ctx],
            "ground_truth": [example.expected_output],
        })
        ragas_result = ragas_evaluate(
            ragas_data, metrics=[faithfulness, answer_relevancy, context_recall]
        )

        return EvalResult(
            input=example.input,
            actual_output=actual_output,
            actual_tool_calls=actual_tools,
            retrieved_context=retrieved_ctx,
            task_complete=task_ok,
            tool_call_accuracy=tool_acc,
            faithfulness_score=ragas_result["faithfulness"],
            relevancy_score=ragas_result["answer_relevancy"],
            latency_ms=latency_ms,
        )

    async def run_full_eval(self) -> dict:
        results = await asyncio.gather(
            *[self.run_single(ex) for ex in self.golden_dataset]
        )
        n = len(results)
        summary = {
            "n": n,
            "task_completion_rate": sum(r.task_complete for r in results) / n,
            "avg_tool_call_accuracy": sum(r.tool_call_accuracy for r in results) / n,
            "avg_faithfulness": sum(r.faithfulness_score for r in results) / n,
            "avg_relevancy": sum(r.relevancy_score for r in results) / n,
            "p50_latency_ms": sorted(r.latency_ms for r in results)[n // 2],
            "p99_latency_ms": sorted(r.latency_ms for r in results)[int(n * 0.99)],
        }
        print(json.dumps(summary, indent=2))
        return summary


# Usage — wire in your agent function
async def my_agent(input_text: str) -> dict:
    # Replace with your actual agent call
    return {"output": "...", "tool_calls": ["search_kb", "route_ticket"],
            "retrieved_context": ["doc1 text", "doc2 text"]}


if __name__ == "__main__":
    harness = AgentEvaluationHarness(
        agent_fn=my_agent,
        golden_dataset_path="eval/golden_dataset_v3.json"
    )
    asyncio.run(harness.run_full_eval())

Evaluation harness combining RAGAS faithfulness/relevancy metrics with tool call accuracy scoring and LLM-as-judge task completion assessment. Run this against your golden dataset on every prompt change or model upgrade. The run_full_eval summary gives you the four key metrics needed to make a go/no-go deployment decision.

Tip

LLM-as-judge evaluation (using GPT-4o-mini to assess whether another model's output is correct) is 10-20x cheaper than human annotation and achieves 85-90% agreement with expert human labels on most well-defined tasks. Use GPT-4o-mini for binary task completion judgments in CI/CD pipelines, and reserve human annotation via Braintrust for the 10-15% of cases where the LLM judge is uncertain or for calibrating new rubrics.

Evaluation-Driven Deployment Checklist

1

Establish baseline metrics before any deployment

Run your evaluation harness against the golden dataset on the current production version to establish baseline numbers. All future deployments are evaluated against this baseline. Never deploy to production without a baseline to compare against.

2

Gate deployments on task completion rate

Set a hard threshold: no deployment goes live if task completion rate drops more than 2 percentage points from baseline. For most production agents, this means requiring 93%+ task completion rate minimum.

3

Run eval on every prompt change in CI/CD

Add the evaluation harness to your CI pipeline. A PR that changes a system prompt should automatically run the golden dataset eval and post metrics as a PR comment. Treat a metric regression as a test failure.

4

Grow the golden dataset continuously

Add 10-20 new examples to the golden dataset monthly, drawn from production cases that were escalated, flagged, or reviewed. The golden dataset should grow to 500+ examples over time to catch subtle regressions.

5

Track metrics over time with dashboards

Export eval metrics to your observability stack (Grafana, Datadog, or Braintrust's built-in dashboard). Trending metrics over time reveal gradual model drift — a real phenomenon as base model providers update underlying models.

Inductivee's Evaluation Baseline Requirements

Before we sign off on any agentic deployment going into production, we require a golden dataset of at minimum 200 examples, a task completion rate of 93%+, tool call accuracy of 95%+, and RAGAS faithfulness score of 0.85+ for any agent that incorporates retrieval. These thresholds reflect what we have found separates 'impressive demo' from 'reliable production system' across 200+ deployments.

The number we care most about in practice is tool call accuracy. An agent that sometimes generates a plausible-sounding but incorrect final answer is annoying. An agent that calls the wrong tool or passes incorrect parameters can corrupt downstream systems, send the wrong email, or write incorrect data to a database. Tool call accuracy is the metric most directly connected to production safety, and it is the one teams most often skip measuring.

For teams just starting out: build the golden dataset first, before writing a single line of agent code. The process of labelling what 'correct' looks like will surface ambiguities in the requirements that would otherwise show up as production failures.

Frequently Asked Questions

How do you evaluate LLM agent reliability?

Agent reliability is measured across four dimensions: task completion rate (percentage of inputs where the agent achieves its goal), tool call accuracy (correct tool selection and parameterisation), output quality (faithfulness and relevance via RAGAS or LLM-as-judge), and failure recovery (graceful handling of tool failures and ambiguous inputs). A golden dataset of 200+ labelled examples is the foundation for all four measurements.

What is RAGAS and when should I use it?

RAGAS is an evaluation framework specifically for retrieval-augmented generation pipelines. It measures faithfulness (is the answer grounded in retrieved documents), answer relevancy (does the answer address the question), and context recall (were the relevant documents retrieved). Use RAGAS for any agent that uses a RAG component — it quantifies the two most common failure modes: hallucination and poor retrieval.

What task completion rate is acceptable for a production agent?

For well-scoped enterprise workflow automation agents, 93%+ task completion rate on a representative golden dataset is the threshold we require before production deployment. Below 90% indicates the agent is not production-ready. The acceptable threshold is lower for lower-stakes workflows (draft generation) and higher for workflows with downstream system writes (ticket routing, PO creation).

How do I build an LLM evaluation golden dataset?

Sample 200-500 inputs from production logs or shadow-mode runs across the full input distribution. Have domain experts (not engineers) label each example with expected outputs and expected tool call sequences. Store the dataset versioned alongside your agent code and run it automatically on every prompt change. Grow it by 10-20 examples monthly from escalated or flagged production cases.

What is LLM-as-judge evaluation?

LLM-as-judge uses a capable model (typically GPT-4o or GPT-4o-mini) to assess whether another model's output meets defined quality criteria, replacing human annotation for large-scale evaluation. It achieves 85-90% agreement with expert human labels on well-defined tasks at 10-20x lower cost. Use it for automated CI/CD eval pipelines, and reserve human annotation for calibrating new rubrics or spot-checking uncertain cases.

Written By

Inductivee Team — AI Engineering at Inductivee

Inductivee Team

Author

Agentic AI Engineering Team

The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.

Agentic AI ArchitectureMulti-Agent OrchestrationLangChainLangGraphCrewAIMicrosoft AutoGen
LinkedIn profile

Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.

Ready to Build This Into Your Enterprise?

Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.

Start a Project