RAG Evaluation in Production: Building Continuous Quality Monitoring
RAG pipeline quality degrades silently as your data changes, embedding models update, and query distributions shift. Here is the continuous evaluation architecture — RAGAS, golden datasets, and drift detection — that keeps production RAG honest.
RAG pipelines degrade along four axes — retrieval failure, hallucination, outdated context, and query-answer mismatch — and most of these degradations are invisible in your application error logs. RAGAS 0.2 provides the metric framework (faithfulness, answer relevancy, context precision, context recall) to quantify each failure mode. A continuous evaluation pipeline that runs these metrics on every deployment against a golden dataset is the only way to maintain production RAG quality as your corpus and usage evolve.
The Silent Degradation Problem in Production RAG
A RAG pipeline deployed in January does not behave the same way in June. Your knowledge base grows and becomes inconsistent. An embedding model gets updated upstream. The query distribution shifts as more users find the system. A new document format gets ingested that your chunking strategy handles poorly. None of these trigger a 500 error. Your uptime monitoring stays green. But the quality of answers quietly declines, and you find out when a user complains or an audit reveals a significant factual error.
This is the fundamental production challenge with RAG: it is a probabilistic pipeline with no built-in correctness assertions. A SQL query that returns wrong data throws an error or returns an empty result set. A RAG pipeline that retrieves wrong context and hallucinates an answer returns HTTP 200 with a confident, coherent, wrong answer.
The solution is continuous evaluation: a pipeline that runs quality metrics against a representative golden dataset on every deployment, tracks metrics over time, and alerts when any metric degrades beyond a threshold. This is MLOps discipline applied to retrieval-augmented generation, and it is not optional for any production RAG system that carries business-critical workloads.
The Four RAG Failure Modes Evaluation Must Catch
1. Retrieval Failure
The retriever does not return the documents needed to answer the question. This can happen because the relevant document was never indexed, the chunking strategy split a key passage across chunk boundaries, the query vocabulary differs from document vocabulary (the semantic gap problem), or the vector index has drifted from the embedding model. Measured by context recall: what fraction of the ground-truth relevant passages appear in the retrieved context?
2. Hallucination
The LLM generates claims that are not supported by the retrieved context. This is distinct from retrieval failure — the context may be correct but the model still generates unsupported statements. More common when the retrieved context is partially relevant (it mentions related topics but not the specific answer), when context is long and the model attends to the wrong sections, or when the model's parametric knowledge conflicts with the retrieved context. Measured by faithfulness: what fraction of the generated claims are grounded in the retrieved context?
3. Outdated Context
The retriever returns documents that were accurate when ingested but are now stale. A policy document updated three months ago that has not been re-ingested will appear in search results and produce confident but incorrect answers. This requires document freshness tracking in your index — every document should have a last-modified timestamp, and documents beyond a staleness threshold should be flagged or removed. Not directly captured by RAGAS metrics, but detectable via golden dataset regression when your ground-truth answers were generated from current document versions.
4. Query-Answer Mismatch
The generated answer is factually consistent with the context but does not actually address the user's query. This happens when the retriever returns topically related but not directly relevant content, and the LLM generates a plausible-sounding answer that drifts from the original question. Measured by answer relevancy: how well does the generated answer address the original query, independent of context? This metric uses the generation itself as a signal — if the answer could have been generated for a different question, relevancy is low.
RAGAS 0.2 Metrics: Definition, Formula, and Failure Mode
| Metric | What It Measures | Score Range | Failure Mode Caught | Implementation |
|---|---|---|---|---|
| Faithfulness | Fraction of answer claims supported by retrieved context | 0.0 – 1.0 | Hallucination | LLM decomposes answer into claims, verifies each against context |
| Answer Relevancy | How well the answer addresses the original query | 0.0 – 1.0 | Query-answer mismatch | LLM generates reverse-questions from answer, computes similarity to original query |
| Context Precision | Fraction of retrieved chunks that are actually relevant | 0.0 – 1.0 | Noisy retrieval | LLM assesses relevance of each retrieved chunk to the query |
| Context Recall | Fraction of ground-truth relevant information present in context | 0.0 – 1.0 | Retrieval failure | LLM checks ground-truth answer sentences against retrieved context |
| Answer Correctness | Semantic + factual similarity to ground-truth answer | 0.0 – 1.0 | Overall quality | Weighted combination of ROUGE-L and LLM-based factual similarity |
Golden Dataset Construction Methodology
A golden dataset is a collection of (query, ground-truth answer, relevant document references) triples that represents the realistic query distribution for your RAG system. Building a high-quality golden dataset is 80% of the work of a continuous evaluation pipeline.
Sampling Queries from Production Logs
The highest-value source for golden dataset queries is production query logs (anonymized and filtered for PII). Cluster queries using embedding similarity to identify the major query types in your system, then sample proportionally from each cluster. This ensures your golden dataset reflects actual user behavior, not what you imagined users would ask at design time. Aim for 100-200 queries minimum; 500+ for a comprehensive evaluation suite.
Generating Ground-Truth Answers
Ground-truth answers must be written or reviewed by domain experts, not generated by the same LLM you are evaluating. Use gpt-4o or Claude Sonnet to generate candidate answers, but have human reviewers validate and correct them. For RAGAS context recall, you also need ground-truth relevant document references — the specific passages that should appear in the retrieved context to answer each query correctly. This is the most labor-intensive part of golden dataset construction.
Synthetic Dataset Generation with RAGAS TestsetGenerator
RAGAS 0.2 includes a TestsetGenerator that synthesizes (query, ground-truth, context) triples from your document corpus. It generates queries of different types: simple factoid questions, multi-hop reasoning questions, and abstractive synthesis questions. Synthetic datasets are faster to create but have lower distributional fidelity than production-sampled datasets. Use synthetic data to bootstrap coverage when you have limited production logs, then refine with human-validated production samples over time.
RAGAS Evaluation Pipeline with Golden Dataset and Pass/Fail Thresholds
import json
import os
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# --- Configuration ---
@dataclass
class EvalThresholds:
faithfulness: float = 0.85
answer_relevancy: float = 0.80
context_precision: float = 0.75
context_recall: float = 0.75
answer_correctness: float = 0.70
@dataclass
class EvalResult:
passed: bool
scores: dict[str, float]
failures: list[str]
timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
dataset_size: int = 0
# --- RAG pipeline stub (replace with your actual pipeline) ---
class ProductionRAGPipeline:
"""Stub for your production RAG pipeline."""
def __init__(self, retriever, llm):
self.retriever = retriever
self.llm = llm
def run(self, query: str) -> dict:
"""Returns {answer, contexts} for a given query."""
# In production: call your actual retriever and LLM
docs = self.retriever.get_relevant_documents(query)
contexts = [d.page_content for d in docs]
# LLM call with retrieved context
context_str = "\n\n".join(contexts)
prompt = f"Context:\n{context_str}\n\nQuestion: {query}\n\nAnswer:"
response = self.llm.invoke(prompt)
return {"answer": response.content, "contexts": contexts}
# --- Golden dataset loader ---
def load_golden_dataset(path: str) -> list[dict]:
"""
Golden dataset format:
[{"question": str, "ground_truth": str, "reference_contexts": [str]}]
"""
with open(path) as f:
return json.load(f)
# --- Evaluation runner ---
class RAGEvaluator:
def __init__(
self,
rag_pipeline: ProductionRAGPipeline,
thresholds: Optional[EvalThresholds] = None,
):
self.pipeline = rag_pipeline
self.thresholds = thresholds or EvalThresholds()
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0))
eval_embeddings = LangchainEmbeddingsWrapper(
OpenAIEmbeddings(model="text-embedding-3-small")
)
self.metrics = [
faithfulness.faithfulness,
answer_relevancy.answer_relevancy,
context_precision.context_precision,
context_recall.context_recall,
answer_correctness.answer_correctness,
]
for m in self.metrics:
m.llm = eval_llm
if hasattr(m, "embeddings"):
m.embeddings = eval_embeddings
def _build_eval_dataset(self, golden: list[dict]) -> Dataset:
questions, answers, contexts, ground_truths = [], [], [], []
for item in golden:
result = self.pipeline.run(item["question"])
questions.append(item["question"])
answers.append(result["answer"])
contexts.append(result["contexts"])
ground_truths.append(item["ground_truth"])
return Dataset.from_dict({
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truth": ground_truths,
})
def run_evaluation(self, golden_dataset_path: str) -> EvalResult:
golden = load_golden_dataset(golden_dataset_path)
print(f"Running evaluation on {len(golden)} golden samples...")
eval_dataset = self._build_eval_dataset(golden)
results = evaluate(eval_dataset, metrics=self.metrics)
scores = {
"faithfulness": float(results["faithfulness"]),
"answer_relevancy": float(results["answer_relevancy"]),
"context_precision": float(results["context_precision"]),
"context_recall": float(results["context_recall"]),
"answer_correctness": float(results["answer_correctness"]),
}
failures = []
thresholds = vars(self.thresholds)
for metric, score in scores.items():
threshold = thresholds.get(metric, 0.0)
if score < threshold:
failures.append(f"{metric}: {score:.3f} < threshold {threshold:.3f}")
result = EvalResult(
passed=len(failures) == 0,
scores=scores,
failures=failures,
dataset_size=len(golden),
)
self._write_report(result)
return result
def _write_report(self, result: EvalResult) -> None:
report_path = f"eval_report_{result.timestamp[:10]}.json"
with open(report_path, "w") as f:
json.dump({
"passed": result.passed,
"timestamp": result.timestamp,
"dataset_size": result.dataset_size,
"scores": result.scores,
"failures": result.failures,
"thresholds": vars(self.thresholds),
}, f, indent=2)
status = "PASSED" if result.passed else "FAILED"
print(f"\nEvaluation {status}")
for metric, score in result.scores.items():
flag = " *** BELOW THRESHOLD" if any(metric in f for f in result.failures) else ""
print(f" {metric}: {score:.3f}{flag}")
if result.failures:
print(f"\nFailed metrics: {result.failures}")
A self-contained RAGAS 0.2 evaluation pipeline: runs all five core metrics against a golden dataset JSON file, compares against configurable thresholds, and writes a structured JSON report. Integrate this as a CI step — fail the deployment if any metric is below threshold.
Run RAGAS evaluations using a separate, dedicated LLM (GPT-4o at temperature=0) from the one being evaluated. RAGAS metrics are themselves LLM-based — the evaluator LLM grades the answers produced by your pipeline LLM. If you use the same model for both roles, you introduce self-grading bias: the model is more likely to assess its own outputs as faithful and relevant. This inflates faithfulness and answer relevancy scores by 5-12% in our benchmarks. Always use a different model family or at minimum a different model version for evaluation.
Continuous Evaluation Pipeline: CI/CD Integration
RAG evaluation should run on every deployment, not just at release milestones.
Add evaluation as a deployment gate
Integrate the RAGAS evaluation runner as a CI pipeline step that runs after deployment to a staging environment. The pipeline should load the current golden dataset, run the full evaluation, and fail the deployment if any metric drops below threshold. Store the evaluation report as a CI artifact for trend tracking.
Track metric trends over time
Write evaluation results to a time-series store (InfluxDB, Prometheus, or a simple PostgreSQL table with a timestamp index). Build a dashboard (Grafana, Metabase) that shows each metric over time. A single evaluation result is a snapshot; the trend reveals whether your pipeline is improving, stable, or slowly degrading.
Alert on metric regression
Set alerts when any metric drops more than 5% week-over-week or falls below the absolute threshold. Route alerts to the engineering team, not just the monitoring dashboard. Context recall drops often precede faithfulness drops — retrieval failure causes the LLM to hallucinate to fill the gap, so monitoring context recall gives you early warning of downstream quality issues.
Expand the golden dataset continuously
Schedule a monthly review to add 20-30 new golden dataset samples from recent production queries. Prioritize queries that generated negative user feedback — these are the edge cases your current golden dataset underrepresents. A golden dataset that only reflects launch-day query patterns will not detect quality regressions on newer use cases.
Inductivee's RAG Quality Monitoring Practice
Every production RAG system we deliver includes a RAGAS evaluation pipeline as part of the deployment infrastructure, not as an afterthought. The golden dataset is built during the development phase — when domain experts and engineers are already collaborating — because constructing it post-deployment, when the budget for that work has evaporated, is when evaluation gets skipped entirely.
The most common quality regression we observe in production RAG systems is context recall degradation caused by knowledge base growth without corresponding re-chunking and re-indexing. As a corpus grows from 1,000 to 50,000 documents, the ANN index characteristics change, retrieval diversity decreases, and the same queries that scored well at launch start missing relevant passages. This is invisible without continuous evaluation — it shows up in RAGAS context recall scores 2-3 months before it shows up in user feedback volume.
For teams currently running RAG without evaluation: start with a minimal 50-question golden dataset on your highest-traffic query patterns and a single RAGAS run today. The baseline you establish now is more valuable than a perfect evaluation setup you build three months from now.
Frequently Asked Questions
What is RAGAS and what does it measure in a RAG pipeline?
What is a golden dataset for RAG evaluation and how do you build one?
How often should RAG evaluation run in production?
What causes RAG faithfulness scores to decrease in production?
Can RAGAS be used without ground-truth answers?
Written By
Inductivee Team
AuthorAgentic AI Engineering Team
The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.
Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.
Engineer This With Inductivee
The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.
Cognitive Web Portals
Enterprise RAG portals and natural-language gateways — we turn your enterprise data into an interactive, self-service AI assistant grounded in your own knowledge.
ServiceCognitive Data Platforms
Cognitive data platforms and generative BI engineering — we transform raw enterprise data into a reasoning knowledge base for LLMs and autonomous agents. Built on vector databases, semantic ETL, and conversational analytics.
Related Articles
RAG Pipeline Architecture for the Enterprise: Five Layers Beyond the Basic Chatbot
Semantic Search for Enterprise Knowledge Bases: Engineering Beyond Full-Text
How to Test Autonomous Agents: Evaluation Frameworks for Production Reliability
Ready to Build This Into Your Enterprise?
Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.
Start a Project