Skip to main content
RAG & Retrieval

RAG Evaluation in Production: Building Continuous Quality Monitoring

RAG pipeline quality degrades silently as your data changes, embedding models update, and query distributions shift. Here is the continuous evaluation architecture — RAGAS, golden datasets, and drift detection — that keeps production RAG honest.

Inductivee Team· AI EngineeringJanuary 15, 2026(updated April 15, 2026)13 min read
TL;DR

RAG pipelines degrade along four axes — retrieval failure, hallucination, outdated context, and query-answer mismatch — and most of these degradations are invisible in your application error logs. RAGAS 0.2 provides the metric framework (faithfulness, answer relevancy, context precision, context recall) to quantify each failure mode. A continuous evaluation pipeline that runs these metrics on every deployment against a golden dataset is the only way to maintain production RAG quality as your corpus and usage evolve.

The Silent Degradation Problem in Production RAG

A RAG pipeline deployed in January does not behave the same way in June. Your knowledge base grows and becomes inconsistent. An embedding model gets updated upstream. The query distribution shifts as more users find the system. A new document format gets ingested that your chunking strategy handles poorly. None of these trigger a 500 error. Your uptime monitoring stays green. But the quality of answers quietly declines, and you find out when a user complains or an audit reveals a significant factual error.

This is the fundamental production challenge with RAG: it is a probabilistic pipeline with no built-in correctness assertions. A SQL query that returns wrong data throws an error or returns an empty result set. A RAG pipeline that retrieves wrong context and hallucinates an answer returns HTTP 200 with a confident, coherent, wrong answer.

The solution is continuous evaluation: a pipeline that runs quality metrics against a representative golden dataset on every deployment, tracks metrics over time, and alerts when any metric degrades beyond a threshold. This is MLOps discipline applied to retrieval-augmented generation, and it is not optional for any production RAG system that carries business-critical workloads.

The Four RAG Failure Modes Evaluation Must Catch

1. Retrieval Failure

The retriever does not return the documents needed to answer the question. This can happen because the relevant document was never indexed, the chunking strategy split a key passage across chunk boundaries, the query vocabulary differs from document vocabulary (the semantic gap problem), or the vector index has drifted from the embedding model. Measured by context recall: what fraction of the ground-truth relevant passages appear in the retrieved context?

2. Hallucination

The LLM generates claims that are not supported by the retrieved context. This is distinct from retrieval failure — the context may be correct but the model still generates unsupported statements. More common when the retrieved context is partially relevant (it mentions related topics but not the specific answer), when context is long and the model attends to the wrong sections, or when the model's parametric knowledge conflicts with the retrieved context. Measured by faithfulness: what fraction of the generated claims are grounded in the retrieved context?

3. Outdated Context

The retriever returns documents that were accurate when ingested but are now stale. A policy document updated three months ago that has not been re-ingested will appear in search results and produce confident but incorrect answers. This requires document freshness tracking in your index — every document should have a last-modified timestamp, and documents beyond a staleness threshold should be flagged or removed. Not directly captured by RAGAS metrics, but detectable via golden dataset regression when your ground-truth answers were generated from current document versions.

4. Query-Answer Mismatch

The generated answer is factually consistent with the context but does not actually address the user's query. This happens when the retriever returns topically related but not directly relevant content, and the LLM generates a plausible-sounding answer that drifts from the original question. Measured by answer relevancy: how well does the generated answer address the original query, independent of context? This metric uses the generation itself as a signal — if the answer could have been generated for a different question, relevancy is low.

RAGAS 0.2 Metrics: Definition, Formula, and Failure Mode

MetricWhat It MeasuresScore RangeFailure Mode CaughtImplementation
FaithfulnessFraction of answer claims supported by retrieved context0.0 – 1.0HallucinationLLM decomposes answer into claims, verifies each against context
Answer RelevancyHow well the answer addresses the original query0.0 – 1.0Query-answer mismatchLLM generates reverse-questions from answer, computes similarity to original query
Context PrecisionFraction of retrieved chunks that are actually relevant0.0 – 1.0Noisy retrievalLLM assesses relevance of each retrieved chunk to the query
Context RecallFraction of ground-truth relevant information present in context0.0 – 1.0Retrieval failureLLM checks ground-truth answer sentences against retrieved context
Answer CorrectnessSemantic + factual similarity to ground-truth answer0.0 – 1.0Overall qualityWeighted combination of ROUGE-L and LLM-based factual similarity

Golden Dataset Construction Methodology

A golden dataset is a collection of (query, ground-truth answer, relevant document references) triples that represents the realistic query distribution for your RAG system. Building a high-quality golden dataset is 80% of the work of a continuous evaluation pipeline.

Sampling Queries from Production Logs

The highest-value source for golden dataset queries is production query logs (anonymized and filtered for PII). Cluster queries using embedding similarity to identify the major query types in your system, then sample proportionally from each cluster. This ensures your golden dataset reflects actual user behavior, not what you imagined users would ask at design time. Aim for 100-200 queries minimum; 500+ for a comprehensive evaluation suite.

Generating Ground-Truth Answers

Ground-truth answers must be written or reviewed by domain experts, not generated by the same LLM you are evaluating. Use gpt-4o or Claude Sonnet to generate candidate answers, but have human reviewers validate and correct them. For RAGAS context recall, you also need ground-truth relevant document references — the specific passages that should appear in the retrieved context to answer each query correctly. This is the most labor-intensive part of golden dataset construction.

Synthetic Dataset Generation with RAGAS TestsetGenerator

RAGAS 0.2 includes a TestsetGenerator that synthesizes (query, ground-truth, context) triples from your document corpus. It generates queries of different types: simple factoid questions, multi-hop reasoning questions, and abstractive synthesis questions. Synthetic datasets are faster to create but have lower distributional fidelity than production-sampled datasets. Use synthetic data to bootstrap coverage when you have limited production logs, then refine with human-validated production samples over time.

RAGAS Evaluation Pipeline with Golden Dataset and Pass/Fail Thresholds

python
import json
import os
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


# --- Configuration ---
@dataclass
class EvalThresholds:
    faithfulness: float = 0.85
    answer_relevancy: float = 0.80
    context_precision: float = 0.75
    context_recall: float = 0.75
    answer_correctness: float = 0.70


@dataclass
class EvalResult:
    passed: bool
    scores: dict[str, float]
    failures: list[str]
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    dataset_size: int = 0


# --- RAG pipeline stub (replace with your actual pipeline) ---
class ProductionRAGPipeline:
    """Stub for your production RAG pipeline."""

    def __init__(self, retriever, llm):
        self.retriever = retriever
        self.llm = llm

    def run(self, query: str) -> dict:
        """Returns {answer, contexts} for a given query."""
        # In production: call your actual retriever and LLM
        docs = self.retriever.get_relevant_documents(query)
        contexts = [d.page_content for d in docs]
        # LLM call with retrieved context
        context_str = "\n\n".join(contexts)
        prompt = f"Context:\n{context_str}\n\nQuestion: {query}\n\nAnswer:"
        response = self.llm.invoke(prompt)
        return {"answer": response.content, "contexts": contexts}


# --- Golden dataset loader ---
def load_golden_dataset(path: str) -> list[dict]:
    """
    Golden dataset format:
    [{"question": str, "ground_truth": str, "reference_contexts": [str]}]
    """
    with open(path) as f:
        return json.load(f)


# --- Evaluation runner ---
class RAGEvaluator:
    def __init__(
        self,
        rag_pipeline: ProductionRAGPipeline,
        thresholds: Optional[EvalThresholds] = None,
    ):
        self.pipeline = rag_pipeline
        self.thresholds = thresholds or EvalThresholds()

        eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0))
        eval_embeddings = LangchainEmbeddingsWrapper(
            OpenAIEmbeddings(model="text-embedding-3-small")
        )
        self.metrics = [
            faithfulness.faithfulness,
            answer_relevancy.answer_relevancy,
            context_precision.context_precision,
            context_recall.context_recall,
            answer_correctness.answer_correctness,
        ]
        for m in self.metrics:
            m.llm = eval_llm
            if hasattr(m, "embeddings"):
                m.embeddings = eval_embeddings

    def _build_eval_dataset(self, golden: list[dict]) -> Dataset:
        questions, answers, contexts, ground_truths = [], [], [], []
        for item in golden:
            result = self.pipeline.run(item["question"])
            questions.append(item["question"])
            answers.append(result["answer"])
            contexts.append(result["contexts"])
            ground_truths.append(item["ground_truth"])
        return Dataset.from_dict({
            "question": questions,
            "answer": answers,
            "contexts": contexts,
            "ground_truth": ground_truths,
        })

    def run_evaluation(self, golden_dataset_path: str) -> EvalResult:
        golden = load_golden_dataset(golden_dataset_path)
        print(f"Running evaluation on {len(golden)} golden samples...")

        eval_dataset = self._build_eval_dataset(golden)
        results = evaluate(eval_dataset, metrics=self.metrics)
        scores = {
            "faithfulness": float(results["faithfulness"]),
            "answer_relevancy": float(results["answer_relevancy"]),
            "context_precision": float(results["context_precision"]),
            "context_recall": float(results["context_recall"]),
            "answer_correctness": float(results["answer_correctness"]),
        }

        failures = []
        thresholds = vars(self.thresholds)
        for metric, score in scores.items():
            threshold = thresholds.get(metric, 0.0)
            if score < threshold:
                failures.append(f"{metric}: {score:.3f} < threshold {threshold:.3f}")

        result = EvalResult(
            passed=len(failures) == 0,
            scores=scores,
            failures=failures,
            dataset_size=len(golden),
        )

        self._write_report(result)
        return result

    def _write_report(self, result: EvalResult) -> None:
        report_path = f"eval_report_{result.timestamp[:10]}.json"
        with open(report_path, "w") as f:
            json.dump({
                "passed": result.passed,
                "timestamp": result.timestamp,
                "dataset_size": result.dataset_size,
                "scores": result.scores,
                "failures": result.failures,
                "thresholds": vars(self.thresholds),
            }, f, indent=2)
        status = "PASSED" if result.passed else "FAILED"
        print(f"\nEvaluation {status}")
        for metric, score in result.scores.items():
            flag = " *** BELOW THRESHOLD" if any(metric in f for f in result.failures) else ""
            print(f"  {metric}: {score:.3f}{flag}")
        if result.failures:
            print(f"\nFailed metrics: {result.failures}")

A self-contained RAGAS 0.2 evaluation pipeline: runs all five core metrics against a golden dataset JSON file, compares against configurable thresholds, and writes a structured JSON report. Integrate this as a CI step — fail the deployment if any metric is below threshold.

Tip

Run RAGAS evaluations using a separate, dedicated LLM (GPT-4o at temperature=0) from the one being evaluated. RAGAS metrics are themselves LLM-based — the evaluator LLM grades the answers produced by your pipeline LLM. If you use the same model for both roles, you introduce self-grading bias: the model is more likely to assess its own outputs as faithful and relevant. This inflates faithfulness and answer relevancy scores by 5-12% in our benchmarks. Always use a different model family or at minimum a different model version for evaluation.

Continuous Evaluation Pipeline: CI/CD Integration

RAG evaluation should run on every deployment, not just at release milestones.

1

Add evaluation as a deployment gate

Integrate the RAGAS evaluation runner as a CI pipeline step that runs after deployment to a staging environment. The pipeline should load the current golden dataset, run the full evaluation, and fail the deployment if any metric drops below threshold. Store the evaluation report as a CI artifact for trend tracking.

2

Track metric trends over time

Write evaluation results to a time-series store (InfluxDB, Prometheus, or a simple PostgreSQL table with a timestamp index). Build a dashboard (Grafana, Metabase) that shows each metric over time. A single evaluation result is a snapshot; the trend reveals whether your pipeline is improving, stable, or slowly degrading.

3

Alert on metric regression

Set alerts when any metric drops more than 5% week-over-week or falls below the absolute threshold. Route alerts to the engineering team, not just the monitoring dashboard. Context recall drops often precede faithfulness drops — retrieval failure causes the LLM to hallucinate to fill the gap, so monitoring context recall gives you early warning of downstream quality issues.

4

Expand the golden dataset continuously

Schedule a monthly review to add 20-30 new golden dataset samples from recent production queries. Prioritize queries that generated negative user feedback — these are the edge cases your current golden dataset underrepresents. A golden dataset that only reflects launch-day query patterns will not detect quality regressions on newer use cases.

Inductivee's RAG Quality Monitoring Practice

Every production RAG system we deliver includes a RAGAS evaluation pipeline as part of the deployment infrastructure, not as an afterthought. The golden dataset is built during the development phase — when domain experts and engineers are already collaborating — because constructing it post-deployment, when the budget for that work has evaporated, is when evaluation gets skipped entirely.

The most common quality regression we observe in production RAG systems is context recall degradation caused by knowledge base growth without corresponding re-chunking and re-indexing. As a corpus grows from 1,000 to 50,000 documents, the ANN index characteristics change, retrieval diversity decreases, and the same queries that scored well at launch start missing relevant passages. This is invisible without continuous evaluation — it shows up in RAGAS context recall scores 2-3 months before it shows up in user feedback volume.

For teams currently running RAG without evaluation: start with a minimal 50-question golden dataset on your highest-traffic query patterns and a single RAGAS run today. The baseline you establish now is more valuable than a perfect evaluation setup you build three months from now.

Frequently Asked Questions

What is RAGAS and what does it measure in a RAG pipeline?

RAGAS (Retrieval Augmented Generation Assessment) is an open-source evaluation framework that measures RAG pipeline quality across five metrics: faithfulness (are generated claims grounded in retrieved context?), answer relevancy (does the answer address the query?), context precision (is the retrieved context focused on the query?), context recall (does the context contain the needed information?), and answer correctness (does the answer match the ground truth?). RAGAS uses LLMs as judges to compute these metrics, enabling reference-free evaluation without requiring human annotators for every evaluation run.

What is a golden dataset for RAG evaluation and how do you build one?

A golden dataset is a collection of (query, ground-truth answer, relevant document references) triples that represents the realistic query distribution for your RAG system. Build it by sampling queries from production logs, clustering by topic, and having domain experts write or validate ground-truth answers. Aim for 100-200 samples minimum, prioritizing queries from high-traffic query clusters and queries that have generated negative user feedback. RAGAS TestsetGenerator can synthesize additional samples from your document corpus to bootstrap coverage.

How often should RAG evaluation run in production?

RAG evaluation should run on every deployment to staging as a deployment gate — failing the deployment if any metric drops below threshold. Additionally, run a scheduled weekly evaluation against the live production system to detect quality drift caused by knowledge base changes, embedding model updates, or query distribution shifts that are invisible in deployment-gate evaluations. Monthly golden dataset reviews should add new samples from recent production queries.

What causes RAG faithfulness scores to decrease in production?

Faithfulness decreases when the LLM generates claims not supported by the retrieved context. Common causes: retrieved context is too long and the model attends to the wrong sections, retrieved documents are topically related but do not directly answer the query (causing the model to fill gaps from parametric memory), prompt changes that reduce the grounding instruction strength, or model updates that change generation behavior. Monitor faithfulness alongside context precision — low context precision (noisy retrieval) is often the root cause of low faithfulness.

Can RAGAS be used without ground-truth answers?

Yes, partially. Faithfulness, answer relevancy, and context precision are reference-free metrics that do not require ground-truth answers — they evaluate the relationship between the query, retrieved context, and generated answer. Context recall and answer correctness require ground-truth answers because they measure coverage of known-correct information. For initial evaluation without a full golden dataset, running the three reference-free metrics provides meaningful signal on retrieval noise and hallucination rates.

Written By

Inductivee Team — AI Engineering at Inductivee

Inductivee Team

Author

Agentic AI Engineering Team

The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.

Agentic AI ArchitectureMulti-Agent OrchestrationLangChainLangGraphCrewAIMicrosoft AutoGen
LinkedIn profile

Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.

Ready to Build This Into Your Enterprise?

Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.

Start a Project