Deploying Enterprise RAG in Production — Implementation Patterns for Agentic RAG and Hybrid Search Validated with a Lao Language Chatbot

March 11, 2026

Lead text

RAG PoC worked. But the moment you deploy to production, accuracy drops and users say it's "unusable" — if that pattern sounds familiar, this article was written to close that gap. We'll walk through step by step how to boost retrieval accuracy by 15–30% using Hybrid search that combines dense vector search with BM25, and how to build a dynamic retrieval pipeline with Agentic RAG that adapts to query intent — including implementation patterns and evaluation metric design. Drawing on the low-resource language challenges we encountered when building a Lao-language RAG chatbot, we'll organize the key design decisions needed to move from PoC to production.

Why Does the Accuracy Gap Between PoC and Production Occur?

There is a fundamental difference between PoC environments and production environments in terms of data quality, quantity, and diversity. In a PoC, validation is performed using "data that works well," whereas in production, a large volume of unexpected queries comes in. This section breaks down the structure of the accuracy gap.

Three Hidden Pitfalls of PoC

The first is data homogeneity. In a PoC, you test with tens to hundreds of pre-formatted internal documents. In production, however, tens of thousands of documents pour in with varying formats and tones — PDFs, Markdown, HTML, meeting minutes, chat logs, and more. A misaligned chunking boundary alone can cause the context needed for an answer to be lost.

The second is query diversity. PoC test queries tend to be skewed toward "questions with known correct answers" written by developers. Production users throw in ambiguous questions, compound questions, and even questions that fall outside the scope of RAG.

The third is the absence of evaluation. In a PoC, things can pass with a vague sense that they seem good enough, but in production, without quantitative accuracy metrics, you won't notice degradation. When our company built a Lao-language RAG system, we verified accuracy during the PoC using only translation tests on English documents — but the actual queries from Lao-speaking users contained colloquial expressions and dialects, making them an entirely different beast from the PoC scores.

Typical Patterns of Accuracy Degradation

Accuracy degradation occurs at three major layers.

At the retrieval layer, the limitations of Dense search alone become apparent. Technical terms and proper nouns may not be placed in proximity within the Embedding space, causing semantically relevant documents to be missed. Conversely, there are cases where chunks that are superficially similar but contextually different are returned at the top of results.

At the chunking layer, information fragmentation caused by fixed-length splitting becomes a problem. Chunks are cut off in the middle of tables or lists, producing chunks whose meaning cannot be understood without the surrounding context.

At the generation layer, passing low-quality retrieval results to the LLM increases hallucinations and off-target responses. No matter how high-performing the LLM is, if retrieval accuracy is poor, the output will not improve. RAG accuracy is determined first and foremost by retrieval.

Prerequisites — Tech Stack and Prior Knowledge

This section organizes the necessary environment and knowledge required to implement the steps in this article.

Recommended Architecture Configuration

Component	Recommended Technology	Role
Vector DB	Qdrant / Weaviate / pgvector	Storage and retrieval of dense vectors
Full-text search	Elasticsearch / OpenSearch / pgvector + pg_bigm	BM25 scoring
Embedding model	OpenAI text-embedding-3-large / Cohere embed-v4 / multilingual models	Text → vector conversion
LLM	Claude / GPT	Answer generation / agent reasoning
Orchestration	LangChain / LlamaIndex / Mastra	Pipeline control
Evaluation	Ragas / DeepEval	Automated evaluation pipeline

A pgvector + PostgreSQL setup is well-suited for starting small. In our own Lao-language RAG implementation, we took the approach of starting with pgvector first, then considering migration to a dedicated vector DB as scale demands grew.

Prerequisite Knowledge Checklist

If you understand the following, you will be able to smoothly follow the steps in this article.

Basic concepts of vector search (Embedding, cosine similarity, ANN)
How BM25 works (TF-IDF-based scoring)
LLM API calls (prompt design, token management)
Basic RAG flow (retrieval → context injection → generation)

If you are unsure about any of these, it is recommended to first build a solid foundation with the Lao Language AI Chatbot RAG Guide.

Step 1: Design the Chunking Strategy for Production Specifications

Chunking is the foundation of RAG accuracy. Move beyond the PoC mindset of "just split at 500 tokens for now" and design a strategy tailored to the structure of your documents.

Optimizing Chunk Size and Overlap

There is no universal correct answer for chunk size. However, there are practical guidelines.

Short chunks (200–400 tokens): Retrieval precision improves, but context tends to be insufficient. Suitable for FAQs and glossaries.
Long chunks (800–1,200 tokens): Context is rich, but noise increases. Suitable for technical documents and contracts.
Overlap: Overlapping 10–20% between chunks mitigates information loss at boundaries.

In the author's experience, the most efficient approach was to start with 400 tokens and an overlap of 50 tokens, then adjust while monitoring evaluation metrics. I have seen projects waste two weeks trying to determine optimal values upfront, but tuning without an evaluation pipeline in place is a waste of time.

Metadata Assignment and Filtering Design

Metadata assignment to chunks is what largely determines the accuracy of production RAG.

json

{
  "chunk_id": "doc-123-chunk-5",
  "source": "internal_policy_v3.2.pdf",
  "section": "Chapter 4: Information Security",
  "page": 12,
  "last_updated": "2026-01-15",
  "department": "IT Department",
  "document_type": "policy"
}

Implementing metadata-based pre-filtering allows you to narrow down the search scope before executing Dense/BM25 retrieval. For a query like "the latest internal regulations on information security," filtering by document_type = "policy" and section LIKE '%Security%' before searching yields dramatically higher accuracy than searching across all chunks.

Our Practical Example of Lao RAG

When we built a Lao-language RAG chatbot, the biggest challenge in chunking was word boundary detection. Like Thai, Lao does not use spaces to separate words, and tokenizers designed for English would cut chunk boundaries in the middle of sentences.

Ultimately, we implemented a custom splitter that uses Lao sentence delimiters (the equivalent of 。) as the primary split point, with byte length as the secondary split. This single change improved retrieval accuracy (Hit Rate@5) from 0.42 to 0.61. For low-resource languages, using a general-purpose chunking library out of the box will not yield good accuracy. Language-specific preprocessing is a prerequisite for production-quality results.

Step 2: Implement Hybrid Search (Dense + BM25)

Dense vector search excels at semantic similarity but struggles with exact matches for proper nouns and keywords. BM25 is the opposite. Hybrid search, which combines both, has been reported in multiple benchmarks to improve accuracy by 15–30% over single-method search.

Comparison of Characteristics: Dense Retrieval and BM25

Characteristic	Dense (Vector Search)	BM25 (Full-Text Search)
Semantic similarity	Strong	Weak
Keyword matching	Weak	Strong
Technical terminology	Depends on embedding quality	Reliable with exact match
Multilingual support	Supported via multilingual embedding models	Requires per-language analyzers
Scalability	Fast with ANN	Fast with inverted index

In practice, the greatest benefit is felt when users submit queries containing unique identifiers such as product codes or article numbers. Dense search alone may miss queries like "specifications for product code ABC-1234," whereas BM25 can match them precisely. On the other hand, Dense search excels at handling vague queries such as "documents similar to last year's security incident response policy."

Selection of Score Integration Method (RRF vs. Weighted Linear Combination)

There are two main methods for score integration in Hybrid Search.

Reciprocal Rank Fusion (RRF) integrates scores using only the rank of each search result. It requires no score normalization, making it easy to combine search engines with different score distributions.

RRF_score(d) = Σ 1 / (k + rank_i(d))

k is typically set to 60. This simplicity is RRF's strength — fewer tuning parameters yield more stable results.

Weighted linear combination normalizes each score before taking a weighted average.

hybrid_score(d) = α × dense_score(d) + (1 - α) × bm25_score(d)

α is often set around 0.7 (favoring Dense), but the optimal value varies depending on the domain and query type. A practical approach is to determine α via grid search in an evaluation pipeline.

The author's recommendation is to start by implementing RRF to establish an accuracy baseline, then consider migrating to weighted linear combination if there is room for further improvement. There is no need to spend time tuning α from the outset — that can wait until sufficient evaluation data has been accumulated.

Implementation Patterns and Benchmark Results

Hybrid search fully contained within PostgreSQL using pgvector + pg_bigm offers a simple infrastructure configuration and low operational costs.

sql

-- Dense search (pgvector)
SELECT id, 1 - (embedding <=> query_embedding) AS dense_score
FROM chunks
ORDER BY embedding <=> query_embedding
LIMIT 20;

-- BM25 search (pg_bigm or pgroonga)
SELECT id, ts_rank(to_tsvector(content), plainto_tsquery('search query')) AS bm25_score
FROM chunks
WHERE to_tsvector(content) @@ plainto_tsquery('search query')
LIMIT 20;

Scores are merged using RRF, and the top 5–10 results are passed to the LLM. In our internal benchmarks, the Hit Rate@5 for Dense alone was 0.72, whereas Hybrid (RRF) improved it to 0.87. The improvement was particularly pronounced for queries containing proper nouns, with a dramatic gain from 0.58 to 0.84.

Step 3: Building a Dynamic Retrieval Pipeline with Agentic RAG

Traditional RAG follows a fixed pipeline of "query → retrieval → generation." Agentic RAG extends this by enabling LLM agents to dynamically determine retrieval strategies. Recognized by Gartner as a notable trend, this approach significantly improves accuracy for complex queries.

What is Agentic RAG? — Differences from Traditional RAG

In conventional RAG (Naive RAG), the user's query is used directly as the search query. However, compound queries such as "Compare last year's security incident response with this year's preventive measures" cannot cover all the necessary information in a single search.

In Agentic RAG, an LLM agent autonomously makes the following decisions:

Query decomposition: Breaking down compound queries into multiple sub-queries
Search strategy selection: Determining whether to use Dense / BM25 / metadata filters for each sub-query
Result evaluation: Assessing whether the search results are sufficient, and re-searching if not
Answer synthesis: Integrating multiple search results to generate a final answer

This is not merely a technical improvement, but a paradigm shift in RAG architecture. The retrieval pipeline transitions from something "fixed programmatically" to something "configured by an agent according to the situation."

Agent Tool Design (Search · Summarization · Re-search)

The Agentic RAG agent is equipped with the following tools.

typescript

const tools = [
  {
    name: "hybrid_search",
    description: "Executes a Dense + BM25 Hybrid search",
    parameters: { query: "string", filters: "object", top_k: "number" }
  },
  {
    name: "metadata_filter",
    description: "Narrows down documents by metadata conditions",
    parameters: { department: "string", doc_type: "string", date_range: "object" }
  },
  {
    name: "summarize_results",
    description: "Summarizes search results and evaluates information sufficiency",
    parameters: { chunks: "array", original_query: "string" }
  },
  {
    name: "refine_query",
    description: "Rewrites the query and performs a new search when results are insufficient",
    parameters: { original_query: "string", missing_info: "string" }
  }
];

The key to tool design is writing specific descriptions for each tool. Since the agent selects tools by reading their descriptions, vague wording leads to frequent incorrect tool selections. Instead of writing "performs a search," write something like "executes a Dense + BM25 Hybrid search and returns a list of chunks with relevance scores."

Implementation Examples of Multi-Step Reasoning

The following shows an actual Agentic RAG flow.

User: "Among the ISO 27001 controls,
       list the ones our company has not yet addressed."

Agent reasoning:
  1. Need a list of ISO 27001 controls
     → metadata_filter(doc_type="standard", title="ISO 27001")
  2. Need our company's compliance status
     → hybrid_search("information security compliance status controls")
  3. Cross-reference search results to identify unaddressed controls
     → summarize_results(chunks, original_query)
  4. Results insufficient (only partial compliance status found)
     → refine_query("department-level security measures implementation status report")
  5. Integrate additional results and generate final response

In this way, Agentic RAG autonomously repeats multiple rounds of search and evaluation for a single query. Where the PoC's fixed pipeline would perform a single search for "ISO 27001 unaddressed controls" and stop, this approach expands that into multi-stage searches adapted to context.

However, multi-step reasoning increases latency and cost. In production, it is essential to set an upper limit on the number of steps (3–5) and impose a timeout (e.g., 30 seconds).

Step 4: Design Evaluation Metrics and Integrate into CI/CD

The biggest challenge with production RAG is that "accuracy degradation often goes unnoticed." By integrating an evaluation pipeline into CI/CD, we can build a mechanism to detect accuracy degradation before deployment.

Faithfulness / Relevance / Answer Correctness

RAG evaluation metrics need to cover both retrieval and generation aspects.

Metric	Measurement Target	Meaning
Context Relevance	Retrieval	Whether the retrieved chunks are relevant to the query
Faithfulness	Generation	Whether the answer is faithful to the content of the retrieved chunks (hallucination detection)
Answer Correctness	Generation	Whether the answer matches the correct answer
Hit Rate@K	Retrieval	Whether the correct chunk is included in the top K results
MRR (Mean Reciprocal Rank)	Retrieval	The average of the reciprocal ranks of the correct chunks

The metric the author values most is Faithfulness. In enterprise use cases, hallucinations (answers that differ from the facts) represent the greatest risk and can instantly destroy user trust. A rule has been established to block deployment whenever Faithfulness falls below 0.95.

Building an Automated Evaluation Pipeline

Using Ragas or DeepEval, you can build an evaluation pipeline in a relatively short period of time.

python

from ragas import evaluate
from ragas.metrics import faithfulness, context_relevancy, answer_correctness

# Test dataset (questions + ground truth + retrieved contexts + generated answers)
result = evaluate(
    dataset=test_dataset,
    metrics=[faithfulness, context_relevancy, answer_correctness],
)

Creating the test dataset is the most time-consuming part. In addition to the test cases used during the PoC, continuously add cases where users provided feedback indicating poor answer quality from production logs. Having a minimum of 100 test cases, and ideally 500 or more, will yield a statistically reliable evaluation.

Automatic Detection and Alerting of Accuracy Degradation

The key to integrating into CI/CD is to establish threshold-based gates.

PR Merge Gate: Block merges if Faithfulness < 0.95 or Context Relevance < 0.80
Scheduled Batch Evaluation: Run evaluations against a sample of production data on a daily basis
Alerts: Send a Slack notification if any metric drops more than 5% compared to the previous week

At our company, we have built a flow that combines this with a HITL (Human-in-the-Loop) mechanism, where humans review cases that fall below the threshold in automated evaluations. Automated evaluation is not a silver bullet, but when it comes to early detection of degradation, it is overwhelmingly faster than human-only review.

Common Mistakes and How to Handle Them

Here are 3 failure patterns I have repeatedly observed in the transition from PoC to production.

1. Wrong Choice of Embedding Model

There are many cases where selecting an Embedding model based solely on English benchmark scores leads to poor accuracy with Japanese or multilingual documents. In our Lao RAG system, OpenAI's text-embedding-3-large performed excellently in English, but Cohere's multilingual model scored 12 points higher in Hit Rate for Lao.

Solution: Always benchmark using your own domain data. Public benchmarks (such as MTEB) are merely reference values, and rankings can change significantly when domain-specific terminology is prevalent.

2. Accuracy Plateau Without a Reranker

Inserting a reranker (Cross-Encoder) after Hybrid search can be expected to yield a further 5–15% improvement in accuracy. However, some projects conclude that "Hybrid alone is sufficient" and hit a ceiling in accuracy.

A reranker is a model that re-scores the top 20–50 search results by pairing them with the query. While it is more computationally expensive than a Bi-Encoder (Embedding), its accuracy is overwhelmingly higher. Cohere Rerank and bge-reranker-v2 are practical options.

Solution: Make a three-stage pipeline the standard configuration: Hybrid search → reranker → pass the top 5 results to the LLM. The increase in latency is approximately 100–300ms, which is acceptable for enterprise use cases.

3. Cost Explosion Due to Prompt Bloat

When you start passing large numbers of chunks to an LLM to improve search accuracy, token consumption spikes rapidly. A context of 10 chunks × 500 tokens = 5,000 tokens drives up the cost per request by several cents, which can translate into a difference of thousands of dollars per month.

Mitigation: Use a reranker to carefully select the top results, and limit the chunks passed to the LLM to 3–5. Another effective approach is to run Abstractive Summarization within the search pipeline to compress token count. Cost optimization must be pursued in parallel with accuracy tuning.

FAQ

Q1: How much accuracy improvement can be expected from Hybrid Search?

Depending on the domain and data characteristics, improvements of 15–30% in Hit Rate@5 compared to Dense alone have been confirmed in multiple academic papers and our own empirical measurements. The improvement is particularly significant for queries containing proper nouns and numerical values. However, the improvement is not uniform across all query types, and for semantically ambiguous queries, performance may be roughly equivalent to Dense alone.

Q2: Doesn't Agentic RAG have a latency problem?

While conventional RAG responds in 1–3 seconds, Agentic RAG can take 5–15 seconds. Practical countermeasures include using streaming responses to reduce perceived latency, setting an upper limit on the number of steps, and routing simple queries to Naive RAG by pre-evaluating query complexity. Not every query needs to be handled agentically.

Q3: Is RAG Practical for Low-Resource Languages?

Practical, but additional effort is required. From our experience with Lao RAG, three points are particularly important: selection of the Embedding model (a multilingual model is essential), language-aware chunking (word boundary detection), and preparation of evaluation data (manual annotation). Before giving up with generic tools and concluding that "accuracy is insufficient," it is worth trying language-specific preprocessing. For more details, please refer to the Lao Language AI Chatbot Construction Guide.

Summary — Production RAG Design Decision Checklist

Here are the design decisions to review when transitioning from PoC to production.

Is chunking optimized for the domain and language? Have you avoided settling for fixed-length splitting?
Have you implemented Hybrid search combining BM25 alongside Dense search?
Are you re-scoring search results with a reranker?
Have you considered introducing Agentic RAG for complex queries?
Are Faithfulness / Context Relevance evaluation pipelines integrated into CI/CD?
Are cost (token consumption) monitoring and alerts configured?

Productionizing RAG is not a problem solved by any single technology. By accumulating the right design decisions at each layer — chunking, retrieval, reranking, generation, and evaluation — you can close the accuracy gap between PoC and production. The most pragmatic approach is to start by introducing Hybrid search and an evaluation pipeline, then progressively expand toward Agentic RAG as query complexity demands.

Author & Supervisor

Yusuke Ishihara

Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).