Cross-Lingual Embeddings Fail for Low-Resource Languages — Empirical Findings on Lao and Multilingual RAG

June 23, 2026

Lead

Cross-lingual embedding search is a technology that retrieves documents in one language using a query in another language, based solely on semantic similarity without translation. Multilingual embedding models are expected to achieve this within a single vector space that spans languages. However, in our actual measurements on a knowledge search (RAG) infrastructure handling Japanese, English, Thai, and Lao, cross-lingual search was nearly non-functional for Lao alone, as a low-resource language. This article presents primary data measuring cross-lingual separation performance by language, a structural explanation of the root cause, and a practical solution in the form of "language-specific indexes." The intended audience is engineers designing multilingual knowledge search or RAG systems. By the end, readers will come away with a revised understanding that "the cross-lingual assumption does not hold for all languages," along with a procedure for empirically measuring separation performance for each target language.

Prerequisites for Multilingual RAG and Cross-Lingual Retrieval

Multilingual RAG often operates on the assumption that "feeding all languages into a single embedding model will connect them semantically across language boundaries." This assumption generally holds for high- and medium-resource languages, but breaks down for low-resource languages. This section first organizes how cross-lingual search works and where that assumption becomes precarious.

What Is Cross-Lingual Retrieval?

Cross-lingual search is a mechanism that embeds both a query and a document into the same vector space—even when they are in different languages—and determines relevance based on the distance (similarity) between them. For example, when a question posed in Thai retrieves a Japanese-language manual, it is because "the Thai sentence" and "the semantically equivalent Japanese sentence" are placed at nearby coordinates in that space. If this holds, a single index can serve multilingual users without needing to prepare knowledge in each language separately. Because this dramatically reduces the cost of multilingual support, many multilingual RAG systems adopt this assumption. The appeal of this approach lies in the fact that adding more supported languages does not require adding more indexes. If a single internal manual written in Japanese can be searched in English or Thai as well, knowledge management can be consolidated into one system. However, this convenience rests on an implicit assumption that "all languages are neatly aligned in the same space." When that assumption breaks down for certain languages, search begins to miss silently, without anyone noticing.

Where Does the "Connected in One Space" Assumption Break Down?

Our own chat and knowledge infrastructure initially relied on a single vector space to handle search across all languages. Since it worked as expected for Japanese, English, and Thai, there was no reason to question this assumption. The problem surfaced when we began validating knowledge search in Lao. Even when querying in Lao, documents that should clearly have matched returned no results at all. At first we suspected indexing issues or preprocessing bugs, but the cause ran deeper: the embedding model was failing to align Lao with other languages in the same space. This prompted us to begin a validation effort measuring separation performance language by language. In hindsight, the warning signs had been hidden on the side of the "working languages." Seeing synonymous sentences rank at the top for Japanese, English, and Thai, we had jumped to the conclusion that multilingual support was sufficient. In reality, the assumption had completely broken down for Lao, which we had never tested. The very generalization—"if it works for major languages, it works for all languages"—was the first assumption we should have discarded.

Our Measurements: Cross-Lingual Separation Performance by Language

The measurement results were clear. Japanese, English, and Thai functioned cross-lingually, while Lao alone produced similarity scores lower than those of unrelated sentences, making it effectively orthogonal to the other languages. All values below were measured by our team using the same multilingual embedding model (higher similarity scores for synonymous sentences indicate better performance).

Cross-Lingual Comparison: Lao Scores Only 0.03

The documents were fixed in Japanese, and similarity to synonymous documents was measured from queries in each language. The similarity of unrelated documents is approximately 0.21, and anything below this can be considered "not connected in meaning."

Conclusion: Japanese, English, and Thai all exceed the noise baseline (approx. 0.21) by a sufficient margin, whereas Lao scores 0.03—lower than noise—and is nearly orthogonal.

Query Language	Similarity to Synonymous Document	Judgment
Japanese	0.61	✅
English	0.55	✅
Thai	0.45	✅ (sufficiently above noise)
Lao	0.03	❌ (lower than unrelated documents)

The fact that Thai functions at 0.45 is significant. Thai and Lao share closely related scripts and vocabulary, yet their cross-lingual performance is entirely different. The intuition that "similar languages should behave the same way" does not hold.

Lao Queries Fail to Reach Any Language

Next, similarity was measured from Lao queries to synonymous documents in each language. If Lao were connected to other languages even indirectly—via English, for instance—a high value should appear in at least one language.

Lao Query →	Similarity
Lao documents (same language)	0.80 ✅
Thai documents	0.13
English documents	0.065
Japanese documents	0.064

The results showed a high value of 0.80 only for Lao documents in the same language, with virtually no reach to any other language. No indirect connection via English was observed either. Lao vectors are clustered in a region isolated from all other languages. Conversely, intra-Lao retrieval is extremely effective, and this is where a path to a solution lies.

Intra-Language Discriminability and Comparison with the Previous Model

Saying "Lao is broken" does not mean everything about Lao fails. When measuring how well synonymous and unrelated documents can be separated within the same language (discriminability = similarity of synonyms − similarity of unrelated documents), Lao does function.

Language	Intra-language Discriminability
Japanese	0.49
English	0.46
Thai	0.64
Lao	0.34

Lao's 0.34 is lower than the other languages, but it can still distinguish synonymous documents from unrelated ones. The somewhat lower value is due to unrelated documents also yielding relatively high similarity scores (the space appears somewhat compressed). Furthermore, comparing against a different general-purpose embedding model (an older model), cross-lingual performance for Lao was already broken in that older model as well (synonyms 0.14 < unrelated 0.25). Meanwhile, intra-language discriminability for Lao improved from 0.18 in the older model to 0.34 in the current model. This indicates that the cross-lingual breakdown is not a bug specific to a particular model, but rather a structural limitation common to low-resource languages.

Why Do Only Low-Resource Languages Break Down?

The key insight is that "the ability to recognize semantic similarity within the same language" and "the ability to align meaning across languages into the same coordinate space" are distinct capabilities, and only the latter depends strongly on the amount of training resources. Each will be examined in turn.

Intra-Language Similarity and Cross-Language Alignment Are Distinct Capabilities

The reason an embedding model can achieve 0.80 for intra-Lao retrieval is that it has learned "how semantically close Lao texts are to one another." This can be acquired as long as there is a sufficient distribution of single-language data. Cross-lingual retrieval, on the other hand, requires an alignment capability that "maps Lao concepts into the same subspace as the same concepts in other languages"—and this demands a large volume of translation-like signals, such as parallel corpora and cross-lingual co-occurrences. Even if intra-language similarity functions correctly, cross-lingual retrieval alone will break down first if inter-language alignment is absent. What was observed with Lao in this case was precisely this "only one side works" state. As an analogy: a map of Lao is drawn accurately within the Lao world, but the coordinate system needed to overlay that map onto the correct position in a world map has not been provided. The accuracy of the map itself (intra-language similarity) and the ability to place that map onto a shared coordinate system (inter-language alignment) are entirely separate problems.

Alignment Capability Depends on Training Resource Volume

Embedding models are trained on web-scale multilingual data, but that distribution is heavily skewed toward high- and medium-resource languages. Thai, as a medium-resource language, has sufficient training coverage and has acquired cross-lingual alignment with other languages — which is why Japanese documents surface from Thai queries. Lao, by contrast, is low-resource, and its cross-lingual alignment training is sparse. As a result, Lao vectors cluster in an isolated region of the embedding space and produce no hits against other languages. The compression phenomenon — where even unrelated sentences yield relatively high similarity scores — is another symptom of the same underlying problem: the space is starved because alignment was never established. Generalizing this: cross-lingual capability is the first thing to break down in low-resource languages. This skew cannot be corrected through careful preprocessing of low-resource language data, because the total volume of training data is what sets the ceiling for cross-lingual alignment. That is precisely why, even after swapping models, cross-lingual performance for low-resource languages hits its ceiling first. The fact that the same symptom reproduced across two models — old and new — corroborates this structural explanation.

Preprocessing and Normalization Tuning Cannot Fix This

When cross-lingual retrieval is found to be broken, the first instinct is to reach for preprocessing fixes: the tokenizer, character normalization, stopword handling. But the failure observed here is not a preprocessing-layer problem. The fact that Lao-to-Lao retrieval works well at 0.80 demonstrates that preprocessing is functioning correctly. What is broken is the cross-lingual alignment that the embedding model learned — or failed to learn — during training. No amount of input refinement will bring Lao vectors closer to the space occupied by other languages. Before spending time on preprocessing improvements, the first step should be to determine whether the issue is "monolingual retrieval works but cross-lingual is broken" or "both are broken." If it is the former, the root cause is a lack of alignment capability, and the remedy lies in the retrieval architecture — not in preprocessing.

Implications for Multilingual RAG Design

The assumption that "a multilingual embedding model means any query in any language will match documents in any other language" is an approximation that holds only for high- and medium-resource languages. For users of low-resource languages, cross-lingual retrieval must not be relied upon. What makes this particularly insidious is that the failure occurs silently — no exceptions, no log entries. Similarity scores are simply low, and the system quietly returns unrelated documents or, depending on the threshold, returns nothing at all. To users, it looks like "there are no answers." To operators, it looks like "search is working." If validation is only performed on high-resource languages and no issues surface, no one will notice that low-resource language users are being subjected to a broken search experience. For monitoring purposes, it is advisable to surface per-language hit rates and average similarity scores on a dashboard, and to watch for any language that consistently shows abnormally low values. Beyond monitoring, there are two design-level implications worth considering.

Generalizing to Low-Resource Languages Beyond Lao

The findings here are not specific to Lao. Since the root cause is insufficient training resources for cross-lingual alignment, the same risk applies to any language with sparse training data. When working with languages that have limited text volume on the web — Khmer, Burmese, various regional languages — the same failure should be suspected. Conversely, for high-resource languages such as English, Chinese, and the major European languages, cross-lingual retrieval is generally reliable. The key is to be conscious of whether a given language falls into the high-, medium-, or low-resource category, and to invest proportionally more in individual validation for languages that lean toward the low-resource end. Measuring cross-lingual separation performance each time a new language is added — this one extra step is the most reliable way to prevent silent retrieval failures.

How to Build Evaluation Data for Measuring Separation Performance

No specialized benchmark is required. For each language, simply prepare a few dozen pairs of (1) queries and documents with the same meaning (synonym pairs) and (2) unrelated queries and documents (unrelated pairs), then compare the distribution of similarity scores for each. If synonym pair similarities sufficiently exceed those of unrelated pairs, retrieval is functioning; if the distributions converge or invert, retrieval is broken. To evaluate cross-lingual performance, prepare pairs where the query and document are in different languages. Too few samples will produce unstable distributions, so aim for at least several dozen pairs per condition. Including vocabulary and phrases that appear frequently in production will yield judgments closer to real-world behavior. Threshold values for retrieval decisions should also be derived from these distributions. Understanding how high unrelated-pair similarity scores can reach allows the cutoff for accepting search results to be tuned per language. Applying a single threshold uniformly across all languages risks picking up unrelated documents in languages whose embedding space is compressed — this warrants caution. In the validation conducted here, this straightforward "synonym vs. unrelated" comparison alone was sufficient to clearly identify the breakdown in Lao.

Solution: Language-Specific Indexes

Even when cross-lingual retrieval is broken, same-language comparison (Lao↔Lao = 0.80) works well. By leveraging this, knowledge is translated into the user's language, stored, and embedded — bringing retrieval into the realm of same-language comparison. This is the concept behind the language-specific index.

Mechanism: Translate and Reduce to Monolingual Comparison

The design comes down to three points. First, knowledge documents are translated into the user's language, then stored and embedded, giving each document language-specific embedding rows. Second, at retrieval time, the query is matched against embeddings in the user's language (i.e., same-language comparison). If a translation is not yet available, the system falls back to the original-language embedding. Third, for translation, language models that perform well on low-resource language pairs are prioritized, with fallback to an alternative system upon failure. In short, the idea is to avoid broken cross-lingual comparisons entirely and make retrieval work solely through same-language comparisons that actually function. The fallback design pays off in both quality and reliability. Knowledge that has not yet been translated is temporarily searchable via its original-language embedding, and once a translation becomes available, the system automatically switches to same-language comparison. This keeps retrieval running continuously, regardless of translation batch progress. Maintaining multiple translation pipelines also prevents a sudden drop in retrieval quality when a specific model fails on a low-resource language pair.

Validation Results: Translation Path 0.64 Outperforms Native Cross-Lingual 0.42

Lao-language queries (questions about LMS configuration, tenant creation, and user registration) were searched against the corresponding knowledge base, and similarity scores were compared across retrieval paths.

Conclusion: Retrieval via the language-specific index translated into Lao scored 0.64, clearly outperforming cross-lingual retrieval against the original Japanese at 0.42.

Retrieval Path	Similarity
Lao translation (language-specific index)	0.64 ✅
Japanese original (cross-lingual)	0.42

As a supplementary note, the reason the original cross-lingual path still achieved 0.42 in this example is that shared Latin-script tokens such as "LMS" and "tenant" in the query contributed to the score. For queries consisting of purely Lao text, the similarity via the original-language path drops to approximately 0. This means that in real-world operational queries, the gap between the language-specific index and the cross-lingual path would be even wider.

Cost-Tradeoff and Phased Rollout

The language-specific index is not a silver bullet. It incurs translation costs (language model calls) and additional embedding costs, and storage grows in proportion to the number of languages maintained. However, it does not need to be applied uniformly across all languages. Applying it to languages where cross-lingual retrieval already works (e.g., English, Thai) is optional; prioritizing it for languages where cross-lingual retrieval breaks down (e.g., Lao) yields the best cost-effectiveness. Furthermore, by designing the system to fall back to the original-language embedding when no translation is available, a phased rollout — covering only some languages or some knowledge documents at first — will not break overall retrieval. Storage growth also remains manageable if the number of target languages is kept limited. If translation rows are maintained for only one or two languages where retrieval breaks down, there is no need to replicate all knowledge documents into all languages. The prudent approach is to start with the most frequently accessed knowledge and the language with the most severe retrieval failures, then expand the scope while measuring the impact. Rather than taking on all costs at once, investment should be made incrementally, within the bounds where retrieval quality improvements are clearly visible.

FAQ

This section compiles questions that frequently arise in practice when handling low-resource languages such as Lao in multilingual RAG systems.

How Can I Verify Whether Cross-Lingual Retrieval Is Working?

The basic approach is to measure "synonym similarity" and "unrelated sentence similarity" for each target language, and check whether the two are sufficiently separated. Languages where synonym similarity falls below or nearly equals that of unrelated sentences can be judged as failing to support cross-lingual search. Since no errors are thrown, verification through numerical values is essential. In practice, start by establishing baseline values using high-resource languages such as Japanese or English, then review a list comparing all languages against that baseline to quickly identify any that are critically underperforming. By continuously tracking average similarity scores and hit rates by language from search logs, you can also detect degradation in production early.

Why Do Results Differ Between Thai and Lao Despite Their Similar Scripts?

This is because the orthographic and lexical proximity between languages is entirely separate from the degree of cross-lingual alignment a given embedding model has learned. Thai, as a mid-resource language, has sufficient training coverage to align well with other languages, whereas Lao, being a low-resource language, has had far less alignment training. Our own measurements confirmed this clearly: Thai scored 0.45, while Lao scored only 0.03. Even closely related languages should be evaluated individually.

If Documents Are Stored as Translations, Won't Translation Quality Affect Retrieval Accuracy?

This is precisely why it is important to design a system that uses a language model with strong coverage of low-resource language pairs for translation, with a fallback to an alternative pipeline upon failure. That said, as our evaluation demonstrated—where the translation route (0.64) outperformed native cross-lingual retrieval (0.42)—the gain from same-language comparison is substantial even with real-world translation quality.

Summary

The cross-lingual capability of multilingual embedding models depends on the resource availability of each language, and breaks down first in low-resource languages. In our own measurements, while Thai functioned effectively in cross-lingual retrieval, the closely related Lao produced similarity scores lower than those of unrelated sentences, making it effectively orthogonal to other languages. Same-language retrieval, on the other hand, worked well. This is precisely why it is essential to treat same-language search and cross-lingual search as distinct problems, and to empirically measure separation performance for each target language. For languages where cross-lingual retrieval breaks down, the appropriate remedy is a per-language index that routes queries through translation into same-language comparison; combining this with an original-text fallback enables safe, incremental rollout. If you are designing multilingual knowledge retrieval, start by measuring "synonym vs. unrelated" across every language you intend to support. We hope the insights from our multilingual knowledge infrastructure prove useful in designing search systems that include low-resource languages.

Author & Supervisor

Chi

Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.