
Improving RAG accuracy refers to a set of techniques that go beyond relying solely on vector search, combining full-text search (BM25), knowledge graphs, and reranking to suppress hallucinations and raise answer quality to a production-ready level. A RAG system that performed well enough during PoC suddenly falls apart when faced with the volume of data and diversity of questions in production — this is a wall that nearly every RAG developer hits. This article is aimed at developers and AI practitioners who have completed an initial RAG build, and explains — following the improvement sequence of "evaluation metric design → retrieval improvement → generation control" — what to work on and in what order to improve accuracy, at the implementation level. Rather than blindly adding components, the approach presented here is to build incrementally while verifying the effect of each change with concrete numbers.
The primary reason a RAG system that achieved high accuracy during PoC breaks down in production is not the intelligence of the model, but that "retrieval is failing to surface the necessary documents." This section first organizes the structural reasons behind stagnating accuracy from two perspectives: the limitations of vector search and the sources of hallucination.
Vector search excels at capturing semantic similarity between sentences, but tends to miss cases where an exact keyword match is required. The most common examples are part numbers, product codes, personal names, regulation numbers, and organization-specific abbreviations — information where the string itself matters more than its meaning. For instance, given the query "specifications for part number ABC-1200," an embedding vector may judge "ABC-1100" or "specifications for a similar product" as semantically close, causing the chunk containing the exact part number to fail to rank at the top.
Another pitfall is vocabulary mismatch between the query and the document. Even if a user writes "retirement allowance" while the internal policy document uses the term "retirement benefit," the meaning is the same but the wording differs — and depending on the training data of the embedding model, the score may not improve. This kind of lexical gap can be significantly reduced by combining a full-text search method such as BM25, described later. Semantic search and keyword search are not a matter of one being superior to the other; they are complementary because they each miss different types of information — this is the practical way to think about it.
Hallucination in RAG (generation not grounded in facts) is more often caused by steps upstream of the generation model than by the model's own tendency to "make things up." When broken down in practice, the causes generally fall into three categories.
The first is retrieval failure. If the document containing the basis for an answer is not included in the retrieval results in the first place, the model will attempt to fill the gap with knowledge from pretraining, producing plausible-sounding errors. The second is retrieval succeeds but context is insufficient. If chunks are too small and surrounding context is cut off, or only part of a table is passed to the model, the model will fill in the missing information with guesswork. The third is insufficient prompt control. If grounding instructions such as "if the answer is not found in the retrieved results, respond with 'I don't know'" are weak, the model will generate an answer even without a basis for it.
The critical point is that the remedies for these three causes are entirely different. Retrieval failure calls for hybrid search or reranking; insufficient context calls for chunk design improvements; insufficient control calls for prompt and output constraints — unless you identify which category applies before taking action, you will waste time on misguided fixes. Rather than lumping "too many hallucinations" together as a single problem, classifying the cause is the starting point for improvement.
The first step in improving accuracy is not improving retrieval or generation — it is making things measurable. Without the ability to track numerically how much each fix improves the system, improvement becomes a matter of intuition. This section explains how to design evaluation metrics with production operation in mind, and how to combine automated evaluation with qualitative evaluation.
In RAG evaluation, the standard practice is to measure generation and retrieval separately. The three representative metrics below each detect different types of problems.
| Metric | What It Measures | Low Score Suggests |
|---|---|---|
| Faithfulness | Whether the answer is grounded in the retrieved results | Hallucination / insufficient control |
| Answer Relevancy | Whether the answer addresses the question | Verbose or off-target generation |
| Context Recall | Whether the necessary documents were retrieved | Retrieval misses |
Faithfulness indicates the proportion of claims in the answer that are supported by the retrieved context. A low score is a sign that the model is speaking without grounding. Answer Relevancy measures how well the answer hits the mark relative to the question. Even if the context is correct, a verbose or vague answer will lower this score. Context Recall shows how much of the information needed for the correct answer was captured at the retrieval stage, directly reflecting the quality of the retrieval component.
By examining these three metrics separately, you can distinguish between "is generation the problem, or is retrieval the problem?" If Faithfulness is high but Context Recall is low, you can conclude that the focus for improvement lies on the retrieval side, not generation. The key is to read the metrics by root cause rather than collapsing them into a single composite score.
Manually scoring the above metrics every time is not practical. This is where automated evaluation frameworks, most notably RAGAS, come in. These frameworks take a question, the retrieved context, and the generated answer (along with a reference answer when needed) as input, and use an LLM as a judge (LLM-as-a-judge) to compute scores for each metric.
A useful starting point for operations is to build an evaluation dataset (golden set) consisting of roughly 50–100 questions representative of expected production queries. Deliberately include representative questions, edge cases, and questions the system has previously failed to answer well. By running this dataset every time you change the retrieval method or chunk size, you can quantitatively determine whether a change is an improvement or a regression.
That said, automated scoring by an LLM is not infallible. Scores can vary depending on the scoring model and the scoring prompt, so always comparing under identical conditions is a prerequisite. It is safer to track relative changes before and after an intervention rather than relying on absolute values. For any items where scores seem unusually high or low, always manually inspect a few actual responses to avoid over-trusting automated evaluation.
Even if automated evaluation scores improve, that does not necessarily mean the user experience has gotten better. Qualities that are difficult to capture numerically—such as clarity of response, tone, and appropriate level of detail—need to be supplemented with qualitative human evaluation.
What tends to work well in practice is a routine of sampling several dozen logs from actual usage on a regular cadence (e.g., weekly) and having a team member review them for quality. Rather than simply marking responses as pass/fail, classifying poor responses by cause—"was it a retrieval failure, a generation failure, or an ambiguous question?"—naturally surfaces what should be improved next.
Another effective approach is to collect user feedback (👍/👎 buttons or free-text comments) and prioritize questions marked with 👎 for inclusion in the evaluation dataset. Using quantitative metrics to monitor overall trends while using qualitative evaluation to dig into individual failures—this two-pronged approach helps avoid the situation where you chase scores while drifting away from what users actually experience. Think of evaluation not as something you build once and finish, but as an asset that grows by absorbing failure cases over time.
Hybrid search is an approach that combines keyword search (BM25) with vector search, integrating the scores from both to determine the final search results. For many RAG systems, it is the most cost-effective single step for improving retrieval accuracy. This section covers the mechanics of score fusion and two key implementation considerations: chunk design and multilingual support.
BM25 (a classical full-text search method based on keyword match frequency) and vector search operate on entirely different scoring scales. BM25 scores have no theoretical upper bound, while cosine similarity in vector search stays in the range of approximately 0 to 1. Simply adding these two scores together produces meaningless results, as one scale will dominate the other.
The widely used solution is RRF (Reciprocal Rank Fusion). RRF discards the absolute score values and uses only the rank assigned by each retrieval method. Specifically, for each document it computes 1 / (k + rank) for each retrieval method and sums those values to determine the final ranking (k is a constant that dampens the influence of top-ranked results, and a value around 60 is commonly used).
| Fusion Method | Basis for Fusion | Characteristics |
|---|---|---|
| Raw score addition | Simple sum of scores | Sensitive to scale differences; difficult to tune |
| Weighted sum | Weighted combination after normalization | High flexibility but requires tuning |
| RRF | Sum of reciprocal ranks | Scale-independent; simple to implement and stable |
RRF has few parameters and is robust regardless of whether vector search or BM25 dominates for a given dataset, making it the easiest fusion method to start with. The pragmatic approach is to first establish a baseline with RRF, then add weighting or reranking if the evaluation dataset reveals it is insufficient—in that order.
The upper limit of search is determined by chunk design (the unit into which documents are divided). If chunks are too large, irrelevant information mixes in and creates noise; if too small, context is severed and meaning breaks down. There is no universal number, but a practical starting point is roughly 300–800 tokens per chunk, adjusted using an evaluation dataset.
Introducing overlap between adjacent chunks can mitigate the problem of information being cut off at sentence or paragraph boundaries. An overlap of around 10–20% serves as a general guideline. However, increasing overlap causes the index to grow larger and results in the same content appearing in multiple hits, making search results redundant—so increasing it indiscriminately should be avoided.
A more effective approach is to split documents along their structure—headings, paragraphs, and bullet points—rather than cutting mechanically by character count. Techniques such as keeping tables intact within a single chunk and prepending headings to chunk beginnings to preserve context can improve search hit quality even with the same token count. Chunk design should not be treated as a one-time decision; it should be continuously revisited by examining failed responses.
Compared to Japanese and English, low-resource languages such as Lao and Thai have several pitfalls that tend to degrade RAG accuracy. The first is tokenization and word segmentation. Because Thai and Lao do not place spaces between words, incorrect segmentation breaks down the units of retrieval, making BM25 keyword matching less effective. Whether or not language-appropriate segmentation processing is applied upstream makes a significant difference in results.
The second is language coverage of embedding models. Even embedding models that claim multilingual support often have a small proportion of Southeast Asian languages in their training data, meaning semantic search accuracy may not match that of major languages. Before deployment, search quality must always be verified using real data in the target language.
The third is mixed-script text. Documents in the field often contain technical terms written in English, or multiple languages mixed within a single sentence. This is precisely where hybrid search combining BM25—which excels at keyword matching—proves effective. The lower the resource level of the language, the greater the benefit of a hybrid configuration over semantic search alone. It is important not to carry over assumptions built for major languages without question.
Graph RAG is a technique that not only vectorizes documents but also maintains entities (people, organizations, products, etc.) and their relationships as a knowledge graph, traversing those relationships to construct answers. While it excels at questions that span multiple documents, its implementation cost is also high. The following clarifies when the investment is worthwhile.
Standard RAG retrieves individual chunks that are close to the query. As a result, it struggles with questions that require integrating relationships across multiple documents. For example, a question such as "Which departments were involved in defects with Product A, and what similar cases have those departments handled in the past?" requires traversing the relationships among multiple entities—product, defect, department, and case—and a simple similarity search can only gather fragmented information.
Graph RAG extracts entities and relationships from documents to build a knowledge graph, and at retrieval time traverses that graph to retrieve scattered information together with its relational context. As a result, response quality tends to improve for cross-cutting and bird's-eye questions such as "Summarize the overall picture" or "Explain the relationship between A and B."
Conversely, for fact-checking questions that are self-contained within a single document (e.g., "What is the expiration date of this regulation?"), the benefit of graph construction is minimal. It is useful to remember that for Graph RAG, the key determinant of effectiveness is whether relationship traversal is actually required.
The drawback of Graph RAG is cost. An additional step is required to accurately extract entities and relationships from documents (often performed using an LLM), and both extraction cost and construction time scale with document volume. Schema design for the graph and the workflow for rebuilding the graph when documents are updated also add operational overhead.
For this reason, jumping straight to Graph RAG from the outset is often not the best approach. From a cost-effectiveness perspective, the practical sequence is to first establish a solid foundation with hybrid search and reranking, and only then—once accuracy has plateaued on "relationship traversal questions" even with that foundation—consider Graph RAG in a targeted manner.
Useful criteria for the decision include: (1) the proportion of cross-cutting and relational questions among all queries, (2) document update frequency (high frequency makes reconstruction costs heavy), and (3) whether sufficient accuracy in entity extraction can be guaranteed. Attempting full-scale deployment before these conditions are met tends to result in a situation where construction and maintenance costs outweigh accuracy gains. Since concrete costs depend heavily on document volume, model pricing, and operational structure, it is recommended to conduct a PoC on a small subset to measure both effectiveness and effort before making a final decision.
There are common "corners-cut steps" in workplaces where RAG accuracy fails to improve. Here we focus on two particularly frequent failures—skipping reranking and index staleness—and present both their causes and countermeasures together.
When accuracy plateaus even after implementing hybrid search, the cause is often the omission of reranking. BM25 and vector search play the role of quickly gathering "candidates," and there is no guarantee that the most relevant documents will appear at the top.
Reranking is the process of rescoring the top 20–50 candidates retrieved by search using a more precise model (such as a cross-encoder that evaluates query-document pairs together), pushing the truly relevant documents to the top. If the first-stage retrieval is "broad and fast," reranking handles "narrow and accurate." This two-stage approach significantly improves the quality of the context ultimately passed to generation.
A common pitfall is a setup that passes the top 5 search results directly to generation. The correct answer may have been included in 20 candidates, but it gets missed because the results are narrowed to the top 5 without reranking. The countermeasure is straightforward: retrieve a larger pool of candidates in search (recall-focused), then ensure precision through reranking (precision-focused) before passing to generation. The computational cost increases, but since this step contributes greatly to accuracy, skipping it should be a last resort.
Accuracy issues stem not only from search algorithms, but also from data freshness. If the source documents have been updated but the index has not kept pace, RAG will confidently return "outdated information." When handling information that is expected to change—such as prices, specifications, regulations, or inventory—this is a failure with significant real-world consequences.
The countermeasure lies in designing an update pipeline suited to the nature of each document. For frequently updated data, set up a mechanism that detects changes and re-indexes only the affected chunks incrementally. Rebuilding everything from scratch each time is costly and tends to cause update delays.
In addition, adopting a practice of including the source (document name and update date) alongside answers allows users to judge the freshness of the information, reducing the risk of taking outdated information at face value. A setup that cannot include "as of what point in time" in its answers is precarious for production use. Index freshness should be regarded as a factor that affects operational reliability just as much as the search algorithm itself.
Here we provide decision-making guidance on two questions that are especially common among developers and AI practitioners working on RAG accuracy improvement.
To state the conclusion first: in most cases, you should try improving RAG first. Fine-tuning (a technique that additionally trains the model itself) is effective for adjusting "behavior"—such as fixing writing style or format, and handling specialized terminology—but it is not suited to solving the problem of "correctly answering with up-to-date facts." Factual freshness and the presentation of evidence are roles for the RAG side.
First, establish evaluation metrics and improve retrieval quality through hybrid search, reranking, and chunk design. If issues with response tone or format remain even after that, then consider fine-tuning—this order offers the best balance of cost and effectiveness. The two are not competing approaches; it is more accurate to view them as complementary tools that solve different problems.
While response tendencies vary across generative models, the single greatest factor determining RAG accuracy is retrieval quality, not model selection. No matter how capable a model is, it cannot answer correctly if the supporting documents are absent from the retrieval results.
That said, differences between models do emerge in areas such as instruction adherence (how reliably a model follows directives like "do not answer if the information is not in the retrieved results"), handling of long contexts, and output consistency. In practice, the most reliable approach is to compare multiple models under identical conditions using your own evaluation dataset and select based on scores for Faithfulness and Answer Relevancy. Focusing on refining retrieval while keeping the model fixed yields a better return on investment for accuracy improvements. Note that generative models are updated rapidly, so it is advisable to re-evaluate the latest version available at the time of selection.
Improving RAG accuracy is not a matter of adding components on a whim—it is an improvement process with a defined order. The key points of this article are summarized below.
By following this sequence of "evaluation → retrieval improvement → generation control," you can build accuracy incrementally, verifying the effect of each measure with concrete numbers rather than relying on intuition. In our production RAG deployment support, we have found that following this sequence—though it may appear roundabout—is the most reliable path. If you are experiencing challenges with RAG accuracy, we recommend starting by establishing an evaluation foundation.
Chi
Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.