
A BPE Tokenizer (Byte-Pair Encoding Tokenizer) is an algorithm that splits text into subword units based on frequently occurring patterns and converts it into a token sequence that an LLM can process. While BPE operates highly efficiently for English, it consumes several times more tokens for the same content in low-resource languages such as Lao, Burmese, and Khmer. This inefficiency not only increases API costs but also directly leads to translation system timeouts and processing delays.
This article is aimed at engineers and tech leads operating multilingual translation systems using LLMs. It explains the mechanism by which BPE tokenizers become inefficient for low-resource languages, and shares practical design countermeasures based on a real Lao translation timeout incident our team encountered.
The efficiency of a BPE tokenizer is strongly dependent on the frequency of a language's appearance in the training corpus; in low-resource languages, byte-level decomposition occurs frequently, causing token counts to balloon. This section digs into the operating principles of BPE and the mechanism by which disparities arise between languages.
BPE (Byte-Pair Encoding) is an algorithm originally developed for data compression, adapted for use in natural language processing. It operates through the following steps:
In English, frequently occurring patterns such as "the," "ing," and "tion" are merged at an early stage, allowing one word to be represented with 1–2 tokens. Japanese hiragana and katakana also undergo a certain degree of merging. However, languages that appear infrequently in the training corpus have few opportunities for merging, and their UTF-8 byte sequences remain as-is.
For example, the English word "the" is 1 token, whereas the equivalent function word in Lao may be decomposed into 6–9 tokens. This difference translates directly into a difference in processing time and cost.
The low token efficiency of Lao stems from four overlapping structural factors.
1. 3 bytes per character in UTF-8
Lao script occupies Unicode positions U+0E80–U+0EFF, consuming 3 bytes per character in UTF-8. If BPE merging has not progressed, a single character can be decomposed into up to 3 tokens. This contrasts sharply with English ASCII characters, which require 1 byte or less per token.
2. Extremely low frequency in the training corpus
BPE vocabulary is built from large-scale corpora such as Common Crawl, ordered by frequency. The volume of Lao text available on the internet is orders of magnitude smaller than English, meaning that dedicated merged tokens for Lao are likely nearly nonexistent. As a result, byte-level fallback decomposition becomes the norm.
3. No word boundaries marked by spaces
Like Thai, Lao does not use spaces to separate words within a sentence. Since BPE pre-tokenization (preprocessing) uses spaces as split points, an entire Lao sentence is treated as a single large pre-token, making efficient segmentation difficult.
4. Additional bytes from tone marks and combining characters
Lao has vowel signs and tone marks placed above, below, before, and after consonants, each of which has its own independent Unicode code point. Representing a single syllable requires multiple code points (i.e., multiple 3-byte characters), further inflating the token count.
The following is a summary of estimated token consumption when expressing the same content in each language.
| Metric | English | Japanese | Thai | Lao |
|---|---|---|---|---|
| UTF-8 bytes/character | 1 | 3 | 3 | 3 |
| Dedicated tokens in BPE vocabulary | Abundant | Moderate | Few | Very few |
| Tokens/word (estimate) | ~1–2 | ~1–3 | ~4–8 | Significantly more than English |
| Estimated cost multiplier vs. English | 1x | ~1.5x | ~3–5x | Several times or more |
The following studies provide academic support by quantifying token efficiency disparities across languages:
Publicly available benchmarks specifically for Lao are limited, but given that Lao faces even more disadvantageous conditions than Thai (with even less training data), it is likely that the token consumption multiplier for Lao exceeds that of Thai.
When translating a 28-section SEO article into Lao using our multilingual CMS, processing failed after exceeding the 480-second timeout. The same article completed without issue in English and Thai, but could not finish within the time limit in Lao alone.
Our translation API operated with the following configuration.
Translation API (maxDuration: 480 seconds) ├── Metadata translation (title, description, keywords): 3 parallel calls ├── Heading translation: all headings processed in a single batch └── Body translation: 5 sections × 6 batches → sequential processing
The body translation parameters were as follows.
| Item | Value |
|---|---|
| Batch size | 5 sections/call |
| Number of batches | ceil(28 / 5) = 6 |
| maxTokens per batch | min(5 × 3,000, 16,000) = 15,000 |
| Bedrock request timeout | 180 seconds |
In English and Thai, each batch completed in 30–60 seconds, leaving ample margin within the 480-second limit even across all 6 batches. In Lao, however, output token inflation significantly increased processing time per batch, causing the cumulative total across 6 batches to exceed the limit.
Note that as a quality improvement measure for low-resource languages, pivot translation via ja→en→lo (a two-stage translation using English as an intermediate language) had already been introduced. The first stage of the pivot (ja→en) completes quickly, but the second stage (en→lo) is affected by token inefficiency.
The timeout was not caused by a single factor, but by the combination of the following three factors.
Factor 1: Output token inflation
Because the BPE tokenizer does not have sufficient vocabulary for Lao, it generates significantly more tokens than English to express the same content. Since generation token count is roughly proportional to processing time, this is the primary source of delay.
Factor 2: Model generation efficiency
In addition to the increase in token count, the model's internal processing efficiency may also be affected when generating low-resource languages. However, this factor is difficult to isolate independently from the token count increase, and verification through measured logs is required.
Factor 3: Cumulative delay from sequential batch design
In sequential processing of 5 sections at a time, fixed overhead costs such as request initialization, context loading, and network round trips accumulate across all 6 batches. The longer the per-batch processing time for a given language, the more this structural weakness is exposed.
In our case, the combination of these three factors caused a process that takes 3–4 minutes total in English to balloon to over 8 minutes in Lao.
The core of the solution lies in two approaches: "reducing the number of API calls" and "dynamic parameter design adapted to language characteristics." The following explains countermeasures in order of priority, from those with immediate effect to medium- and long-term improvements.
The most cost-effective approach is to increase the batch size for low-resource languages and reduce the number of API calls.
1// Definition of low-resource languages (also accommodates future language additions)
2const LOW_RESOURCE_LANGS: Set<string> = new Set(["lo", "my", "km"]);
3
4// Dynamic batch size per language
5const BODY_BATCH_SIZE = LOW_RESOURCE_LANGS.has(targetLang) ? 14 : 5;This change reduces the number of API calls for a 28-section article from 6 to 2. Continuous generation within a single request incurs less overhead than splitting across multiple requests, so a reduction in total processing time can be expected.
However, since the number of output tokens per request increases, attention must also be paid to the maxTokens ceiling. For low-resource languages, a practical approach is to fix maxTokens at the upper limit (16,000) and derive the optimal value based on measured data.
1// For low-resource languages, fix at the upper limit to err on the side of safety
2const maxTokens = LOW_RESOURCE_LANGS.has(targetLang)
3 ? 16000
4 : Math.min(batch.length * 3000, 16000);The translation system's timeout is composed of multiple layers, and all layers must be designed in a consistent manner.
| Layer | Before | After | Reason |
|---|---|---|---|
| Vercel Function (maxDuration) | 480 seconds | 800 seconds | Extended to the Pro plan's Fluid Compute upper limit |
| Bedrock HTTP request | 180 seconds | 300 seconds | Individual request time increases due to larger batch sizes |
Extending maxDuration is a temporary measure and carries the risk of recurring if the number of sections increases further in the future. Fundamentally, improving the batch design (reducing the number of API calls) is the primary countermeasure, and it is appropriate to position the timeout extension as a complementary safety net.
Expanding the batch size and adjusting timeouts will resolve the immediate issue, but mid-to-long-term improvements should also be considered in preparation for very long articles reaching 40–50 sections, or the addition of languages with even lower token efficiency.
Streaming Translation (as a UX improvement)
Using AWS Bedrock's InvokeModelWithResponseStreamCommand, text can be received in chunks as it is being generated. However, on Vercel, elapsed time during streaming is also counted toward maxDuration, so this does not serve as a timeout workaround. Its proper role is strictly as a means of providing progress feedback to the client (e.g., displaying "Translating: 12/28 sections complete") and improving the user experience.
Full-batch Translation of All Sections
An approach that translates all 28 sections in a single API call, using section-delimiter markers (===SECTION_N===) and parsing the output. Since only one API call is made, fixed overhead is minimized; however, there is a risk of the output being cut off at the maxTokens limit (16,000). A fallback design is required that detects truncated output and translates the remaining sections in a second batch.
The same issue as with Lao is highly likely to occur with Burmese, Khmer, and Tibetan as well. Thai is a medium-risk language, but the problem has not materialized under the current timeout settings.
| Language | Script | Risk | Basis |
|---|---|---|---|
| Lao (lo) | Lao script | High | Currently occurring |
| Burmese (my) | Myanmar script | High | Ahia et al. report 4–9× compared to English |
| Khmer (km) | Khmer script | High | Similar script system, insufficient training data |
| Tibetan (bo) | Tibetan script | High | Complex conjunct characters, extremely limited training data |
| Thai (th) | Thai script | Medium | Approximately 3.8× per Typhoon. Within current settings but with little margin |
When planning multilingual expansion, it is desirable to design the system so that simply adding a target language to the LOW_RESOURCE_LANGS set applies all countermeasures at once. Before adding a new language, token consumption should be measured empirically using test text, and appropriate values for batch size and maxTokens should be verified in advance.
It will not be fully resolved. Algorithms other than BPE exist, such as SentencePiece (Unigram) and WordPiece, but all of them share the same dependency on the frequency distribution of the training corpus. If a low-resource language is underrepresented in the training data, vocabulary bias will occur regardless of the algorithm.
An approach that shows promise for improvement is retraining a custom tokenizer with an additional corpus for the target language (as adopted by Typhoon for Thai), but this requires action on the part of the LLM provider and is not an area that API users can directly control. For API users, the practical approach is to design around the differences in token efficiency as a given — through batch tuning and timeout design.
Token efficiency on the input side will improve. In direct ja→lo translation, tokens are consumed reading the Japanese source text. With pivot translation via ja→en→lo, the input for the second stage becomes English (the most token-efficient language), reducing input token consumption.
However, the token inefficiency on the output side (generating Lao text) remains unchanged with pivot translation. Since output token count is the primary driver of processing time, pivot translation alone does not fundamentally resolve the timeout issue. From a quality standpoint, pivot translation offers significant advantages (translation into low-resource languages tends to be more stable in quality when routed through English rather than translated directly), so the recommended approach is to adopt pivot translation for the balance of quality and speed while controlling processing time through batch design.
Costs rise accordingly. Many LLM APIs use pay-as-you-go pricing based on input and output token counts, meaning that if the same content requires a different number of tokens depending on the language, costs scale proportionally.
Petrov et al. (2023) identify this issue as "cross-linguistic inequity." For example, a Lao-speaking user ends up paying several times more than an English-speaking user to process the same amount of information.
Options available to API users are limited, but the following are worth considering:
BPE tokenizers perform highly efficiently for high-resource languages centered on English, while structural inefficiencies are unavoidable for low-resource languages such as Lao. In LLM-based translation systems, these inefficiencies manifest as timeouts, increased costs, and processing delays.
There are three key countermeasures:
When expanding multilingual support, it is recommended to verify the token efficiency of target languages in advance and incorporate an operational workflow for adding them to the parameter set for low-resource languages (LOW_RESOURCE_LANGS).
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).
Chi
Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.