
RAG PoC worked. But the moment you deploy to production, accuracy drops and users say it's "unusable" — if that pattern sounds familiar, this article was written to close that gap. We'll walk through step by step how to boost retrieval accuracy by 15–30% using Hybrid search that combines dense vector search with BM25, and how to build a dynamic retrieval pipeline with Agentic RAG that adapts to query intent — including implementation patterns and evaluation metric design. Drawing on the low-resource language challenges we encountered when building a Lao-language RAG chatbot, we'll organize the key design decisions needed to move from PoC to production.

There is a fundamental difference between PoC environments and production environments in terms of data quality, quantity, and diversity. In a PoC, validation is performed using "data that works well," whereas in production, a large volume of unexpected queries comes in. This section breaks down the structure of the accuracy gap.
The first is data homogeneity. In a PoC, you test with tens to hundreds of pre-formatted internal documents. In production, however, tens of thousands of documents pour in with varying formats and tones — PDFs, Markdown, HTML, meeting minutes, chat logs, and more. A misaligned chunking boundary alone can cause the context needed for an answer to be lost.
The second is query diversity. PoC test queries tend to be skewed toward "questions with known correct answers" written by developers. Production users throw in ambiguous questions, compound questions, and even questions that fall outside the scope of RAG.
The third is the absence of evaluation. In a PoC, things can pass with a vague sense that they seem good enough, but in production, without quantitative accuracy metrics, you won't notice degradation. When our company built a Lao-language RAG system, we verified accuracy during the PoC using only translation tests on English documents — but the actual queries from Lao-speaking users contained colloquial expressions and dialects, making them an entirely different beast from the PoC scores.
Accuracy degradation occurs at three major layers.
At the retrieval layer, the limitations of Dense search alone become apparent. Technical terms and proper nouns may not be placed in proximity within the Embedding space, causing semantically relevant documents to be missed. Conversely, there are cases where chunks that are superficially similar but contextually different are returned at the top of results.
At the chunking layer, information fragmentation caused by fixed-length splitting becomes a problem. Chunks are cut off in the middle of tables or lists, producing chunks whose meaning cannot be understood without the surrounding context.
At the generation layer, passing low-quality retrieval results to the LLM increases hallucinations and off-target responses. No matter how high-performing the LLM is, if retrieval accuracy is poor, the output will not improve. RAG accuracy is determined first and foremost by retrieval.

This section organizes the necessary environment and knowledge required to implement the steps in this article.
| Component | Recommended Technology | Role |
|---|---|---|
| Vector DB | Qdrant / Weaviate / pgvector | Storage and retrieval of dense vectors |
| Full-text search | Elasticsearch / OpenSearch / pgvector + pg_bigm | BM25 scoring |
| Embedding model | OpenAI text-embedding-3-large / Cohere embed-v4 / multilingual models | Text → vector conversion |
| LLM | Claude / GPT | Answer generation / agent reasoning |
| Orchestration | LangChain / LlamaIndex / Mastra | Pipeline control |
| Evaluation | Ragas / DeepEval | Automated evaluation pipeline |
A pgvector + PostgreSQL setup is well-suited for starting small. In our own Lao-language RAG implementation, we took the approach of starting with pgvector first, then considering migration to a dedicated vector DB as scale demands grew.
If you understand the following, you will be able to smoothly follow the steps in this article.
If you are unsure about any of these, it is recommended to first build a solid foundation with the Lao Language AI Chatbot RAG Guide.

Chunking is the foundation of RAG accuracy. Move beyond the PoC mindset of "just split at 500 tokens for now" and design a strategy tailored to the structure of your documents.
There is no universal correct answer for chunk size. However, there are practical guidelines.
In the author's experience, the most efficient approach was to start with 400 tokens and an overlap of 50 tokens, then adjust while monitoring evaluation metrics. I have seen projects waste two weeks trying to determine optimal values upfront, but tuning without an evaluation pipeline in place is a waste of time.
Metadata assignment to chunks is what largely determines the accuracy of production RAG.
1{
2 "chunk_id": "doc-123-chunk-5",
3 "source": "社内規程_v3.2.pdf",
4 "section": "第4章 情報セキュリティ",
5 "page": 12,
6 "last_updated": "2026-01-15",
7 "department": "情報システム部",
8 "document_type": "policy"
9}Implementing metadata-based pre-filtering allows you to narrow down the search scope before executing Dense/BM25 retrieval. For a query like "the latest internal regulations on information security," filtering by document_type = "policy" and section LIKE '%セキュリティ%' before searching yields dramatically higher accuracy than searching across all chunks.
When we built a Lao-language RAG chatbot, the biggest challenge in chunking was word boundary detection. Like Thai, Lao does not use spaces to separate words, and tokenizers designed for English would cut chunk boundaries in the middle of sentences.
Ultimately, we implemented a custom splitter that uses Lao sentence delimiters (the equivalent of 。) as the primary split point, with byte length as the secondary split. This single change improved retrieval accuracy (Hit Rate@5) from 0.42 to 0.61. For low-resource languages, using a general-purpose chunking library out of the box will not yield good accuracy. Language-specific preprocessing is a prerequisite for production-quality results.

Dense vector search excels at semantic similarity but struggles with exact matches for proper nouns and keywords. BM25 is the opposite. Hybrid search, which combines both, has been reported in multiple benchmarks to improve accuracy by 15–30% over single-method search.
| Characteristic | Dense (Vector Search) | BM25 (Full-Text Search) |
|---|---|---|
| Semantic similarity | Strong | Weak |
| Keyword matching | Weak | Strong |
| Technical terminology | Depends on embedding quality | Reliable with exact match |
| Multilingual support | Supported via multilingual embedding models | Requires per-language analyzers |
| Scalability | Fast with ANN | Fast with inverted index |
In practice, the greatest benefit is felt when users submit queries containing unique identifiers such as product codes or article numbers. Dense search alone may miss queries like "specifications for product code ABC-1234," whereas BM25 can match them precisely. On the other hand, Dense search excels at handling vague queries such as "documents similar to last year's security incident response policy."
There are two main methods for score integration in Hybrid Search.
Reciprocal Rank Fusion (RRF) integrates scores using only the rank of each search result. It requires no score normalization, making it easy to combine search engines with different score distributions.
RRF_score(d) = Σ 1 / (k + rank_i(d))
k is typically set to 60. This simplicity is RRF's strength — fewer tuning parameters yield more stable results.
Weighted linear combination normalizes each score before taking a weighted average.
hybrid_score(d) = α × dense_score(d) + (1 - α) × bm25_score(d)
α is often set around 0.7 (favoring Dense), but the optimal value varies depending on the domain and query type. A practical approach is to determine α via grid search in an evaluation pipeline.
The author's recommendation is to start by implementing RRF to establish an accuracy baseline, then consider migrating to weighted linear combination if there is room for further improvement. There is no need to spend time tuning α from the outset — that can wait until sufficient evaluation data has been accumulated.
Hybrid search fully contained within PostgreSQL using pgvector + pg_bigm offers a simple infrastructure configuration and low operational costs.
1-- Dense search (pgvector)
2SELECT id, 1 - (embedding <=> query_embedding) AS dense_score
3FROM chunks
4ORDER BY embedding <=> query_embedding
5LIMIT 20;
6
7-- BM25 search (pg_bigm or pgroonga)
8SELECT id, ts_rank(to_tsvector(content), plainto_tsquery('search query')) AS bm25_score
9FROM chunks
10WHERE to_tsvector(content) @@ plainto_tsquery('search query')
11LIMIT 20;Scores are merged using RRF, and the top 5–10 results are passed to the LLM. In our internal benchmarks, the Hit Rate@5 for Dense alone was 0.72, whereas Hybrid (RRF) improved it to 0.87. The improvement was particularly pronounced for queries containing proper nouns, with a dramatic gain from 0.58 to 0.84.

Traditional RAG follows a fixed pipeline of "query → retrieval → generation." Agentic RAG extends this by enabling LLM agents to dynamically determine retrieval strategies. Recognized by Gartner as a notable trend, this approach significantly improves accuracy for complex queries.
In conventional RAG (Naive RAG), the user's query is used directly as the search query. However, compound queries such as "Compare last year's security incident response with this year's preventive measures" cannot cover all the necessary information in a single search.
In Agentic RAG, an LLM agent autonomously makes the following decisions:
This is not merely a technical improvement, but a paradigm shift in RAG architecture. The retrieval pipeline transitions from something "fixed programmatically" to something "configured by an agent according to the situation."
The Agentic RAG agent is equipped with the following tools.
1const tools = [
2 {
3 name: "hybrid_search",
4 description: "Executes a Dense + BM25 Hybrid search",
5 parameters: { query: "string", filters: "object", top_k: "number" }
6 },
7 {
8 name: "metadata_filter",
9 description: "Narrows down documents by metadata conditions",
10 parameters: { department: "string", doc_type: "string", date_range: "object" }
11 },
12 {
13 name: "summarize_results",
14 description: "Summarizes search results and evaluates information sufficiency",
15 parameters: { chunks: "array", original_query: "string" }
16 },
17 {
18 name: "refine_query",
19 description: "Rewrites the query and performs a new search when results are insufficient",
20 parameters: { original_query: "string", missing_info: "string" }
21 }
22];The key to tool design is writing specific descriptions for each tool. Since the agent selects tools by reading their descriptions, vague wording leads to frequent incorrect tool selections. Instead of writing "performs a search," write something like "executes a Dense + BM25 Hybrid search and returns a list of chunks with relevance scores."
The following shows an actual Agentic RAG flow.
User: "Among the ISO 27001 controls,
list the ones our company has not yet addressed."
Agent reasoning:
1. Need a list of ISO 27001 controls
→ metadata_filter(doc_type="standard", title="ISO 27001")
2. Need our company's compliance status
→ hybrid_search("information security compliance status controls")
3. Cross-reference search results to identify unaddressed controls
→ summarize_results(chunks, original_query)
4. Results insufficient (only partial compliance status found)
→ refine_query("department-level security measures implementation status report")
5. Integrate additional results and generate final responseIn this way, Agentic RAG autonomously repeats multiple rounds of search and evaluation for a single query. Where the PoC's fixed pipeline would perform a single search for "ISO 27001 unaddressed controls" and stop, this approach expands that into multi-stage searches adapted to context.
However, multi-step reasoning increases latency and cost. In production, it is essential to set an upper limit on the number of steps (3–5) and impose a timeout (e.g., 30 seconds).

The biggest challenge with production RAG is that "accuracy degradation often goes unnoticed." By integrating an evaluation pipeline into CI/CD, we can build a mechanism to detect accuracy degradation before deployment.
RAG evaluation metrics need to cover both retrieval and generation aspects.
| Metric | Measurement Target | Meaning |
|---|---|---|
| Context Relevance | Retrieval | Whether the retrieved chunks are relevant to the query |
| Faithfulness | Generation | Whether the answer is faithful to the content of the retrieved chunks (hallucination detection) |
| Answer Correctness | Generation | Whether the answer matches the correct answer |
| Hit Rate@K | Retrieval | Whether the correct chunk is included in the top K results |
| MRR (Mean Reciprocal Rank) | Retrieval | The average of the reciprocal ranks of the correct chunks |
The metric the author values most is Faithfulness. In enterprise use cases, hallucinations (answers that differ from the facts) represent the greatest risk and can instantly destroy user trust. A rule has been established to block deployment whenever Faithfulness falls below 0.95.
Using Ragas or DeepEval, you can build an evaluation pipeline in a relatively short period of time.
1from ragas import evaluate
2from ragas.metrics import faithfulness, context_relevancy, answer_correctness
3
4# Test dataset (questions + ground truth + retrieved contexts + generated answers)
5result = evaluate(
6 dataset=test_dataset,
7 metrics=[faithfulness, context_relevancy, answer_correctness],
8)Creating the test dataset is the most time-consuming part. In addition to the test cases used during the PoC, continuously add cases where users provided feedback indicating poor answer quality from production logs. Having a minimum of 100 test cases, and ideally 500 or more, will yield a statistically reliable evaluation.
The key to integrating into CI/CD is to establish threshold-based gates.
At our company, we have built a flow that combines this with a HITL (Human-in-the-Loop) mechanism, where humans review cases that fall below the threshold in automated evaluations. Automated evaluation is not a silver bullet, but when it comes to early detection of degradation, it is overwhelmingly faster than human-only review.

Here are 3 failure patterns I have repeatedly observed in the transition from PoC to production.
There are many cases where selecting an Embedding model based solely on English benchmark scores leads to poor accuracy with Japanese or multilingual documents. In our Lao RAG system, OpenAI's text-embedding-3-large performed excellently in English, but Cohere's multilingual model scored 12 points higher in Hit Rate for Lao.
Solution: Always benchmark using your own domain data. Public benchmarks (such as MTEB) are merely reference values, and rankings can change significantly when domain-specific terminology is prevalent.
Inserting a reranker (Cross-Encoder) after Hybrid search can be expected to yield a further 5–15% improvement in accuracy. However, some projects conclude that "Hybrid alone is sufficient" and hit a ceiling in accuracy.
A reranker is a model that re-scores the top 20–50 search results by pairing them with the query. While it is more computationally expensive than a Bi-Encoder (Embedding), its accuracy is overwhelmingly higher. Cohere Rerank and bge-reranker-v2 are practical options.
Solution: Make a three-stage pipeline the standard configuration: Hybrid search → reranker → pass the top 5 results to the LLM. The increase in latency is approximately 100–300ms, which is acceptable for enterprise use cases.
When you start passing large numbers of chunks to an LLM to improve search accuracy, token consumption spikes rapidly. A context of 10 chunks × 500 tokens = 5,000 tokens drives up the cost per request by several cents, which can translate into a difference of thousands of dollars per month.
Mitigation: Use a reranker to carefully select the top results, and limit the chunks passed to the LLM to 3–5. Another effective approach is to run Abstractive Summarization within the search pipeline to compress token count. Cost optimization must be pursued in parallel with accuracy tuning.

Depending on the domain and data characteristics, improvements of 15–30% in Hit Rate@5 compared to Dense alone have been confirmed in multiple academic papers and our own empirical measurements. The improvement is particularly significant for queries containing proper nouns and numerical values. However, the improvement is not uniform across all query types, and for semantically ambiguous queries, performance may be roughly equivalent to Dense alone.
While conventional RAG responds in 1–3 seconds, Agentic RAG can take 5–15 seconds. Practical countermeasures include using streaming responses to reduce perceived latency, setting an upper limit on the number of steps, and routing simple queries to Naive RAG by pre-evaluating query complexity. Not every query needs to be handled agentically.
Practical, but additional effort is required. From our experience with Lao RAG, three points are particularly important: selection of the Embedding model (a multilingual model is essential), language-aware chunking (word boundary detection), and preparation of evaluation data (manual annotation). Before giving up with generic tools and concluding that "accuracy is insufficient," it is worth trying language-specific preprocessing. For more details, please refer to the Lao Language AI Chatbot Construction Guide.

Here are the design decisions to review when transitioning from PoC to production.
Productionizing RAG is not a problem solved by any single technology. By accumulating the right design decisions at each layer — chunking, retrieval, reranking, generation, and evaluation — you can close the accuracy gap between PoC and production. The most pragmatic approach is to start by introducing Hybrid search and an evaluation pipeline, then progressively expand toward Agentic RAG as query complexity demands.
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).