
The quality of a RAG system is determined by two axes: whether retrieval is pulling the right documents, and whether the generated answer faithfully reflects that context. LLM-as-a-Judge is an approach that delegates this quality assessment itself to a large language model. We begin by examining the limitations of conventional evaluation methods and the reasoning behind using an LLM as a judge.
Traditional methods for measuring RAG answer quality fell into two broad categories: human evaluation and rule-based evaluation. Both hit a wall once RAG reaches production scale.
Human evaluation involves having people read responses and score them for accuracy and usefulness. The quality of judgment is high, but as the number of items to evaluate grows into the hundreds or thousands, time and cost skyrocket, and scoring criteria drift between evaluators. Re-scoring every item with each release is simply not practical.
Rule-based evaluation mechanically measures surface-level overlap with reference answers using metrics like BLEU or ROUGE. It is fast and reproducible, but it tends to score answers that are semantically correct yet differently worded too low, while scoring answers that merely look similar but are factually wrong too high. It is ill-suited for RAG evaluation, which demands semantic correctness.
In short, human evaluation does not scale, and rule-based evaluation cannot capture meaning. LLM-as-a-Judge fills this gap.
LLM-as-a-Judge is an approach that uses a large language model capable of understanding the meaning of responses as an evaluator (judge), delivering human-like assessments automatically and at scale.
The judge LLM is given the question, the retrieved context, and the generated answer, and is asked to return a score or pass/fail verdict based on predefined criteria (e.g., faithfulness to context, whether the question is answered). This makes it possible to evaluate far more items in a short time than human reviewers could manage, and because the criteria are fixed in the prompt, variance between evaluators is kept in check.
Unlike rule-based evaluation, a key advantage is that answers can be scored correctly even when phrased differently, as long as the meaning is right. On the other hand, since the judge is itself an LLM, it carries inherent weaknesses — biases and scoring inconsistencies discussed later. This is precisely why the design and validation of evaluation criteria are what ultimately determine quality.
While LLM-as-a-Judge can be implemented from scratch, using an evaluation framework is the common practice in real-world settings. The most prominent example is RAGAS, which provides both RAG-specific evaluation metrics and the mechanism for measuring them with an LLM judge, all in one package.
RAGAS comes standard with metrics well-suited to RAG — such as faithfulness, answer relevancy, and context precision — and internally calls a judge LLM to compute scores. It gets you up and running faster than building your own prompt and scoring logic from scratch, and it standardizes metric definitions across the board.
There are other general-purpose evaluation libraries as well, but this article focuses on RAGAS, which is widely used for RAG evaluation, to walk through the implementation steps for LLM-as-a-Judge. Note that the meaning of each metric and how to leverage them for improving RAG accuracy are covered in detail in How to Improve RAG Accuracy; this article keeps its focus on automating evaluation.
The accuracy and reproducibility of an evaluation pipeline are determined by laying the groundwork before running any metrics. Here we confirm three prerequisites: clarifying what is being evaluated, selecting the judge LLM, and preparing the test set.
The first thing to do is clarify what you are evaluating. A RAG pipeline is divided into two stages — "retrieval" and "generation" — and evaluation metrics correspond to each. Without deciding which output of which stage to measure, you cannot apply the metrics even after selecting them.
Concretely, you need to be in a position to log at least the following three things for evaluation: the user's question, the context (chunks) retrieved by the search, and the final generated answer. Depending on the metric, a ground truth (reference) answer may also be required.
If your existing RAG system only stores answers, you will need to modify it to log the retrieved context as well. Without the context, metrics that measure retrieval quality cannot be calculated. Evaluation design begins with organizing this output format.
The basic principle for selecting an LLM to use as a judge is to choose one with judgment capability equal to or greater than the model that generated the answers being evaluated. Using a less capable model as the judge makes the evaluation itself unreliable. Frontier models such as GPT, Claude, and Gemini are candidates.
When selecting a model, verify the following three points. First, does it have a context length sufficient to handle long inputs (passing the question, retrieved context, and answer together)? Second, can it return structured output — such as scores or JSON — to stabilize evaluations? Third, can it handle the API rate limits and costs required by the volume of evaluations?
If internal data is being passed through for evaluation, data handling also requires attention. For highly confidential data, consider using a local LLM running on-premises or in a private environment as the judge, rather than an external API.
The reliability of an evaluation is largely determined by the quality of the test set (golden dataset). This is evaluation data that bundles together "anticipated questions" and, where necessary, "expected answers and the supporting context."
A good test set satisfies three conditions: it reflects the distribution of questions that real users will ask; it includes not only easy questions but also ambiguous, compound, and adversarial ones; and it is continuously updated from production data. Building a test set from questions that developers come up with on their own will fail to capture the failures that occur in production.
More examples yield more stable results, but a practical approach is to start with tens to hundreds of examples covering representative use cases, then continuously add failure cases discovered in production. Note that if the test set is constructed incorrectly, the evaluation results become unreliable. Typical failures of this kind are covered in detail in a later section.
There are many evaluation metrics for RAG, but measuring all of them indiscriminately makes interpretation difficult. Here we will look in order at representative metrics scored by a judge, their correspondence to RAGAS, and how to prioritize them by use case.
When evaluating RAG with LLM-as-a-Judge, the three central metrics are as follows. Each differs in terms of the evaluation perspective — that is, what the judge looks at when scoring.
Faithfulness: Whether the generated answer is grounded in the retrieved context. The judge checks whether each claim in the answer can be derived from the context, and scores it lower if the answer states information not present in the context (i.e., hallucination).
Answer Relevancy: Whether the answer properly addresses the intent of the question. The judge penalizes answers that are accurate but off-topic, or verbose with an unclear main point.
Context Precision: How much of the retrieved context actually contains what is needed to answer the question. This is a metric that measures retrieval quality, not generation quality.
Faithfulness and Answer Relevancy assess the quality of generation, while Context Precision assesses the quality of retrieval. The value of measuring these metrics separately lies in the ability to isolate which stage a problem originates from.
The metrics described in the previous section are implemented as standard metrics in RAGAS, where they are automatically calculated by internally invoking an LLM judge. Faithfulness corresponds to faithfulness, answer relevance to answer relevancy, and retrieval quality to context precision / context recall.
Even when implementing LLM-as-a-Judge from scratch, keeping these correspondences in mind provides a useful guide for metric design. When using RAGAS, verify what data (question, context, answer, ground truth) each metric requires, and ensure your test set satisfies those inputs.
What each metric's numerical score means, and how to translate low scores into RAG improvements (such as reranking or hybrid search), is systematically explained in How to Improve RAG Accuracy: Reducing Hallucinations and Hybrid Search. This article focuses specifically on "how to automatically score these metrics using a judge and integrate them into operations."
Not all metrics need to be weighted equally. Depending on the use case for your RAG system, the metrics you should prioritize first will differ.
For use cases where accuracy is critically important—such as internal regulations, legal compliance, or medical and financial applications—faithfulness should be the top priority. This is because answering with information not found in the context (hallucination) represents the greatest risk. On the other hand, for use cases where clarity of response matters, such as customer support or internal help desks, the weight given to answer relevancy should be increased.
For knowledge bases with large numbers of documents where noise is likely to be mixed in, context precision should be monitored to continuously track retrieval quality.
Once priorities are established, reflect them in your pass/fail thresholds. For example, setting different thresholds per metric based on the risk profile of the use case—such as "a faithfulness score below the baseline is a failure"—is the practical approach. Demanding high performance across all metrics simultaneously tends to dilute the focus of improvement efforts.
From here, we walk through the process of implementing an LLM-as-a-Judge evaluation pipeline using RAGAS in four steps. We will build it up in sequence, from environment setup through judge prompt design to score aggregation and output.
Start by setting up the evaluation environment. Prepare a Python environment with RAGAS and the LLM client for the judge (the SDK for whichever model you intend to use).
Next, format your evaluation data to match what RAGAS expects. The four core elements are: question, retrieved context, generated answer, and (if needed) ground truth. In RAGAS, these are passed as a dataset with each element as a column.
1from datasets import Dataset
2
3data = {
4 "question": ["What is the number of paid leave days granted under the work rules?"],
5 "contexts": [["10 days are granted after 6 months of employment..."]],
6 "answer": ["10 days are granted after 6 months of employment."],
7 "ground_truth": ["10 days granted after 6 months of continuous service."],
8}
9dataset = Dataset.from_dict(data)The key point here is to put the actually retrieved context into contexts. If you pass idealized context rather than the actual retrieval results from production, the evaluation will not reflect retrieval quality. Once the input format is fixed, the next step is to design the judge itself.
The quality of LLM-as-a-Judge is largely determined by the design of the prompt passed to the judge. When using RAGAS's standard metrics, you can rely on the internal prompts; however, if you want to score based on custom criteria, you will need to design your own judge prompt.
There are three key points in the design. First, write the scoring criteria in terms of concrete, specific aspects rather than vague adjectives. Instead of asking "Is this a good answer?", ask "Is each claim in the answer supported by the context?" Second, fix the output to a structured format such as a score or JSON so that results can be aggregated mechanically downstream. Third, having the judge also output its reasoning allows you to trace the cause of low scores later.
1You are a strict evaluator. Determine whether the following answer is based solely on the provided context.
2If the answer contains information not found in the context, set faithful=false.
3Output only in JSON format: {"faithful": true/false, "reason": "..."}Whether to use a graduated scale such as 1–5 or a binary pass/fail depends on the use case. Graduated scoring makes it easier to track trends, while binary scoring makes it easier to automate pass/fail judgments.
Once individual scoring is complete, aggregate the results across the entire test set and produce a report that humans can review and interpret. In RAGAS, calling evaluate allows you to compute scores for each metric across the entire dataset in one pass.
1from ragas import evaluate
2from ragas.metrics import faithfulness, answer_relevancy, context_precision
3
4result = evaluate(
5 dataset,
6 metrics=[faithfulness, answer_relevancy, context_precision],
7)
8print(result)The aggregated output should not only show average scores per metric, but also make it easy to extract individual low-scoring cases. Looking at averages alone won't tell you which questions failed or why.
A report structured around per-metric averages, comparisons against thresholds, and a list of failing cases makes it easier to drive actionable improvements. By integrating this output into the CI/CD pipeline described later, evaluation can become a continuous process rather than a one-time exercise.
LLM-as-a-Judge is a powerful approach, but because the judge itself is an LLM, it comes with its own inherent pitfalls. This section covers three typical failure modes that can undermine trust in evaluation results, along with strategies to avoid them. These issues are specific to this technique, and overlooking them can lead to a situation where you are evaluating but quality never improves.
LLMs used as judges are known to exhibit several biases. The most common include a tendency to score longer answers more favorably, a preference for options presented first, and a self-evaluation bias where the model scores its own generated outputs more leniently.
The self-evaluation problem deserves particular attention. If the same model is used both to generate answers and to act as the judge, evaluations risk being overly lenient. Effective countermeasures include using a different model family as the judge, and scoring with multiple models and cross-checking the results.
While bias cannot be eliminated entirely, its impact can be reduced. Concretize the scoring criteria to narrow the judge's discretion, periodically check whether answer length or ordering is influencing results, and at key checkpoints, compare against human evaluations to verify the judge's validity. Rather than trusting the judge blindly, it is important to treat the judge itself as something that must also be evaluated.
When scores are unstable across repeated evaluations of the same answer, the cause is most often ambiguity in the judge prompt.
Instructions like "rate the quality of this answer out of 10" leave it entirely up to the judge to determine what warrants a high score, causing results to vary from run to run. Breaking the scoring criteria down into specific, concrete dimensions and having the judge assess each one individually significantly reduces this variability.
Practical techniques for improving stability include setting a low temperature to stabilize outputs, using few-shot examples to show the judge the scoring criteria and expected output format, and scoring the same case multiple times to verify consistency.
Before beginning large-scale evaluation, always verify with a small number of cases that "repeated scoring of the same input produces stable results." If this is unstable and you proceed to bulk evaluation anyway, it becomes impossible to tell whether differences in scores reflect differences in quality or simply noise from the judge.
Test set contamination (data contamination) refers to a state in which the data used for evaluation makes results appear artificially better than they actually are. In LLM-as-a-Judge, this typically occurs in two ways.
The first is when the questions or ground-truth answers from the test set inadvertently leak into the RAG pipeline—for example, as source documents or prompt examples. This amounts to giving the system questions it already knows the answers to, making it impossible to measure real-world performance. Evaluation data and the knowledge or examples provided to the RAG system must be kept strictly separate.
The second is when the test set becomes stale and diverges from production data. If the questions actually coming in from users have changed but evaluation is still based on old questions, scores may look good while the system fails in practice.
The countermeasures are to physically isolate evaluation data from the RAG knowledge base, and to regularly incorporate real-world failure cases discovered in production into the test set to keep evaluations current. Contaminated evaluation can sometimes be more dangerous than having no evaluation at all—because good numbers breed complacency.
A one-time evaluation has limited value. Because RAG quality fluctuates with data and model updates, a mechanism that automatically runs evaluations with each change is necessary. The overall design for integrating an evaluation pipeline into CI/CD for continuous evaluation is also covered in Enterprise RAG Implementation Guide for Production. Here, we outline the key points for running LLM-as-a-Judge as automated evaluation.
A typical continuous evaluation setup automatically triggers evaluations in response to changes in code, prompts, or data. If you are using GitHub, you can run evaluation scripts triggered by pull requests or merges into specific branches.
1name: rag-eval
2on:
3 pull_request:
4 branches: [main]
5jobs:
6 evaluate:
7 runs-on: ubuntu-latest
8 steps:
9 - uses: actions/checkout@v4
10 - run: pip install ragas datasets
11 - run: python eval/run_ragas.py
12 env:
13 LLM_API_KEY: ${{ secrets.LLM_API_KEY }}The API key for the judge LLM is passed securely as a repository secret. Since evaluations incur API costs, it is practical to control frequency—for example, running per pull request rather than on every commit. When the test set is large, separating a lightweight evaluation that covers only the relevant changes from a periodic full evaluation allows you to balance cost and coverage.
The final key to making automated evaluation valuable is managing thresholds (pass/fail criteria) and alerts. Simply recording scores is not enough—you will not notice when quality degrades.
First, define pass/fail thresholds for each metric and configure CI to fail when a score falls below them. For example, a pull request where faithfulness drops below the baseline would be blocked from merging. This prevents changes that degrade quality from reaching production.
Start with fixed thresholds and adjust them based on real-world conditions. Thresholds that are too strict will block legitimate changes, while thresholds that are too lenient will allow degradation to go unnoticed. Review historical score trends and calibrate the thresholds to a level appropriate for the risk profile of your use case.
In addition to CI failure notifications, it is advisable to have a mechanism that periodically samples production scores to detect degradation. By running evaluation as both a "pre-release gate" and an "in-production monitor," you can maintain RAG quality on an ongoing basis. LLM-as-a-Judge is the core technique that enables this automation.
Chi
Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.