
The combination of on-premises LLM and model distillation is a technique for compressing the knowledge held by a large language model (teacher) into a smaller model (student), enabling fully self-contained operation within an internal server without relying on external cloud services. This article is aimed at IT system administrators and AI engineers who cannot send data to cloud APIs, and explains the process from environment setup to production deployment of distilled models, covering model selection, data construction, training, and failure avoidance, in that order. Technical terms are supplemented as they appear, so the article is structured to give readers a clear overall picture even without a background in machine learning.
Model distillation is the solution that simultaneously addresses two challenges: the constraint of not being able to send internal data outside the organization, and the high cost of running large models on-premises. A student model that has been reduced in size through distillation can realistically run on a company's own GPUs, and any data entered never leaves the organization. This section begins by outlining the risks of cloud usage and the problems that distillation solves.
When using a cloud LLM API, internal documents and customer information included in prompts are transmitted to external servers. This entails multiple risks: the possibility that input data may be used by the provider for training or quality improvement (depending on the contract plan), the retention of logs for a certain period, the storage of data in overseas data centers, and the possibility that terms of service may change in the future.
For companies in industries such as finance, healthcare, and manufacturing that handle drawings, source code, and regulated personal data, even if an enterprise contract stipulates that the provider will "not use data for training," the very fact that "data physically leaves the organization" still creates accountability obligations for audits and regulatory compliance. With on-premises operation, inference can be physically isolated from the network, structurally guaranteeing that data remains within the organization. This is the fundamental reason why on-premises deployments are chosen in environments with stringent security requirements.
If one assumes that "running a large model on-premises is sufficient to keep everything in-house," the next obstacle encountered is cost. Running a state-of-the-art large-scale model on-premises as-is requires multiple expensive GPUs with hundreds of gigabytes of VRAM, which is not realistic for most companies.
Model distillation compresses the knowledge of this large model (teacher) into a smaller model (student) with a scale of several billion parameters, dramatically reducing the hardware requirements. The trade-off is that general-purpose performance falls short of the teacher model, but when narrowed down to the specific tasks a company uses, it is relatively easy to maintain practical accuracy. While the fixed cost of GPU procurement remains, per-token usage fees disappear, making the approach increasingly advantageous in terms of total cost of ownership (TCO) the more frequently it is used. Since the break-even point depends on the chosen model and usage volume, it is important to calculate estimates based on your organization's own usage.
Small Language Models (SLMs: models in the range of hundreds of millions to tens of billions of parameters) offer faster inference and lower response latency due to their smaller parameter count. They are well-suited for handling batch processing and concurrent requests in on-premises environments, and can be made even lighter when combined with quantization.
In terms of accuracy, for narrowly scoped tasks such as classification, extraction, summarization, and internal document QA, distilled small models can often achieve quality close to that of large models. On the other hand, for use cases requiring free-form reasoning or extended chains of thought, the gap between small and large models tends to persist. What matters here is not to chase high scores on general-purpose benchmarks, but to first define the "accuracy level required for your organization's tasks" and measure it against your own evaluation data. The extent to which speed or accuracy should be prioritized varies depending on the use case.
Before "just trying out" distillation, you need to solidify three things: hardware, licensing, and data quality. Leaving these vague tends to result in legal roadblocks after training, or having to start over due to poor accuracy.
Distillation training uses GPUs for both the teacher model's inference (soft label generation) and the student model's training. Production use (inference only) is less demanding than training, and quantization can reduce requirements further.
| Phase | Primary Load | Notes |
|---|---|---|
| Teacher inference (soft label generation) | Proportional to teacher size | Large teachers require substantial VRAM for inference |
| Student training | Student size + batch + gradients | Can be reduced with PEFT |
| Production inference | Student size only | Minimized through quantization |
The minimum requirements are determined by the student model's size; for models in the multi-billion parameter range, a single 24GB-class GPU is often a viable starting point. Memory and storage requirements scale with dataset size. Since the specific VRAM needed varies greatly depending on the model and batch size, the safe approach is to run a small-scale PoC (proof of concept) first, measure actual usage, and then scale from there. Avoid committing to a large-scale setup from the outset.
The choice of teacher model alone can determine whether legal risk exists. The first pitfall to be aware of is using outputs from commercially available API-based models directly as teacher signals to train your own model. Major providers explicitly restrict the use of their outputs for "developing competing models" in their terms of service, and doing so may constitute a violation (both OpenAI and Anthropic prohibit using outputs to train competing or imitation models). Cases of large-scale terms of service violations have in fact been reported.
The practical way to avoid this risk is to use open-source models with licenses that permit commercial distillation as the teacher. However, even open-source models vary in their licensing. MIT and Apache 2.0-based models (such as DeepSeek, Qwen, Mistral, and Phi) are relatively permissive and commercially friendly, while Llama models use Meta's proprietary community license, which requires separate authorization for operators with extremely large monthly active user counts and includes regional restrictions. Gemma requires agreement to Google's terms of use. When making your selection, always verify the primary source—each model's license page on its official repository or distribution site—and run it through legal review. License terms are subject to updates, so do not rely on past information as-is.
Post-distillation accuracy is largely determined not by the quality of the teacher signal itself, but by the data design—that is, what you are training the student to learn. Preprocessing involves deduplication, removal of noise and obvious errors, assignment of sensitivity labels, and format normalization.
The two quality criteria to prioritize are: whether the data reflects the actual distribution of real-world tasks (representativeness), and whether labels and formatting are consistent. It is easy to focus on collecting volume, but a small amount of high-quality data frequently outperforms a large amount of mixed-quality data. In practice, you should never rely solely on automated processing—always include a step where samples are manually reviewed to confirm that the inputs genuinely reflect what comes up in actual operations. Anonymization of data containing personal information will be covered in detail in a later step.
Select the teacher as "the model with a usable license for your organization that performs well on the target task," and select the student by working backwards from "the size that can be run in production on your own GPUs." The axis of selection is optimization within constraints, not maximization of performance.
In conclusion, the safest approach is to use license permissiveness as the primary gate, then narrow down by task suitability and model size.
| Model Family | License Tendency | Characteristics |
|---|---|---|
| Qwen series | Apache 2.0-based | Multilingual support, wide range of sizes |
| Mistral series | Apache 2.0 | Lightweight and highly efficient |
| Phi series | MIT | Small-model focused, low inference cost |
| Gemma series | Requires agreement to Google Terms of Use | Commercial use permitted after agreement |
| DeepSeek series | MIT, etc. | High performance, but license verification required |
| Llama series | Meta proprietary (restrictions for large-scale operators) | Broad ecosystem |
※ Licenses and terms are subject to change; always verify the latest primary sources at the time of selection.
The selection criteria are easiest to evaluate in the following order: ① License (whether commercial distillation is permitted), ② Accuracy on your own tasks (evaluated on your own data), ③ Size (whether it fits on your production GPU), ④ Japanese/multilingual support, and ⑤ Community activity (ease of accessing information and updates). Prioritize fit with your own constraints over absolute benchmark scores.
Student models are better suited to being built as small, purpose-specific models rather than as scaled-down versions of large general-purpose models. The following outlines typical use cases and their corresponding scale requirements. For document classification and key information extraction, models in the hundreds of millions to a few billion parameter range are often sufficient. For internal document QA, a practical configuration is to combine RAG (Retrieval-Augmented Generation) with a mid-sized student model handling the generation component. Summarization calls for a mid-sized model, while code completion is best built on a code-specialized pretrained model as its foundation.
"Specialized" here refers to an existing small model that has been distilled and fine-tuned on your own tasks. Rather than trying to replicate the full capabilities of a general-purpose chat model, focusing on the one or two tasks actually used in your operations makes it easier for even a small model to reach a practical level of performance. A design philosophy of not overreaching ultimately leads to achieving both lower operational costs and sufficient accuracy.
In conclusion, the standard approach is to select the minimum model size that meets your requirements along two axes: "acceptable latency" and "required accuracy."
| Business Requirement | Recommended Size Range | Rationale |
|---|---|---|
| Real-time response (conversational) | Small to medium | Latency is the priority |
| Batch processing (e.g., overnight aggregation) | Medium to large | Accuracy is the priority; speed is secondary |
| Simple classification or extraction | Small | Narrow task scope |
| Complex reasoning or long-form generation | Large, or combined with cloud | Small models tend to hit their limits |
The mapping process follows these steps: ① enumerate the tasks your organization will use, ② quantify the accuracy and latency requirements for each task as concretely as possible, and ③ start with the smallest model size and scale up incrementally through PoC. Selecting a large model from the outset unnecessarily burdens you with both cost and latency. Stopping at the size that meets your requirements is what pays off in real-world operations.
Distillation data should be constructed in two layers: "teacher outputs (soft labels)" and "your own ground-truth data." For both, having the data reflect the distribution of inputs that actually arrive in production is the key to quality.
Soft labels refer to the probability distributions that a teacher model assigns to each class or token. Unlike hard labels, which assume a single correct answer, soft labels contain information about how confidently the teacher views each candidate option.
The generation procedure is as follows. ① Prepare representative input data. ② Run inference with the teacher model and save the output logits or probability distributions (at this stage, raise the temperature parameter described later to smooth out the distribution). ③ Use these as the learning targets for the student. Because soft labels contain information that approximates "why the teacher made that judgment," they tend to improve the student's generalization performance compared to training on hard labels alone. However, since generating soft labels incurs the inference cost of the teacher model, proceed with an eye toward balancing the required data volume against available computational resources.
To convert documents scattered across the organization (PDFs, Office files, internal wikis, support tickets, etc.) into training data, build a step-by-step pipeline. ① Extraction: Convert to text using parsers or OCR (for PDFs saved as images, using an LLM for text conversion is effective). ② Cleansing: Remove headers, footers, and duplicates. ③ Structuring: Format into QA format or "instruction–response" format. ④ Confidentiality labeling: Assign data classification labels. ⑤ Splitting: Divide into training and validation sets.
For RAG use cases, the focus is on chunk splitting and embedding generation; for distillation use cases, it centers on formatting into instruction-response pairs. While automating the process, always manually spot-check samples of the resulting data. Experience shows that the most common cause of accuracy degradation is not flaws in sophisticated algorithms, but the contamination of garbage data.
Even in a fully on-premises setup, if training data contains personal information or confidential content, the risk remains that the model will memorize it and output it later. Address this in three stages. ① Detect and mask or pseudonymize PII (names, contact details, account numbers, etc.). ② Based on confidentiality classification, selectively decide whether to include data in training at all. ③ After training, conduct red-team-style testing to verify the model does not leak confidential information.
Under regional data protection laws such as Thailand's PDPA, there are restrictions on the use and storage of personal data beyond its original purpose, so it is necessary to clarify—along with the legal basis—whether using data for AI training falls within the scope of the purpose for which it was collected. Even data that has been anonymized may allow re-identification of individuals when fragments are combined, so it is advisable to include re-identification risk assessment as part of the response.
The core of training lies in two things: designing the distillation loss (the loss that makes the student resemble the teacher), and monitoring to prevent overfitting. More than the act of running commands, it is the design of these two elements that ultimately determines accuracy.
The distillation loss function is generally designed by combining two terms. The first is a term that brings the student's output closer to the teacher's soft labels, for which KL divergence—which measures the distance between probability distributions—is commonly used. The second is the standard cross-entropy that brings the student's output closer to the ground truth (hard labels).
This is where the temperature parameter T comes into play. T adjusts the smoothness of the softmax; the higher the value, the more the teacher's probability distribution is smoothed out, making it easier to transfer the "implicit knowledge" of inter-class relationships to the student. In practice, T and the weighting coefficients of the two terms (such as α) are tuned through hyperparameter search. If T is too high, information becomes diluted; if too low, it approaches hard labels—so the optimal point is found by monitoring accuracy on the validation set. Even without fully understanding the theory, knowing that "T and the weights are parameters to be tuned" is enough to get started with implementation.
On-premises training is commonly built on PyTorch and the Hugging Face library ecosystem. Tools such as Accelerate and DeepSpeed are used in conjunction for distributed training. The core of the process is as follows:
1. Load the teacher model and student model 2. Pre-generate soft labels (or distill on-the-fly) 3. Define a custom distillation loss combining KL divergence + cross-entropy 4. Run the training loop and save checkpoints based on validation metrics
As a guide for configuration: start with a small learning rate, secure an effective batch size via gradient accumulation adjusted to VRAM, and save memory with mixed precision (fp16/bf16). To reduce the cost of fine-tuning the student, combine with PEFT methods such as LoRA. In a fully offline environment, models and dependency packages must be fetched in advance to an internal mirror. Since specific commands and arguments vary by framework version, always refer to the official documentation for the relevant version during implementation.
During training, monitor multiple metrics in parallel: the loss for both training and validation, a breakdown of distillation loss versus task loss, and task accuracy on your in-house evaluation data (such as accuracy or F1).
The basic rule for early stopping is to halt just before validation loss bottoms out and begins to rise — since a rise is a sign of overfitting. That said, relying on loss alone is risky. Even when loss is decreasing, the quality of actual outputs can be degrading, so make it a habit to visually inspect real task metrics and output samples every epoch. Retain multiple checkpoints and ultimately adopt the one with the best validation metrics. The point that "the last epoch is not necessarily the best" is easy to overlook, but it is a surprisingly common mistake.
The two typical stumbling blocks in distillation come down to "insufficient accuracy (the gap with the teacher)" and "overfitting to in-house data." Knowing about them in advance means most cases can be avoided at the design stage.
When the student is too small or the teacher is too large, the student may be unable to absorb the knowledge, causing a significant drop in accuracy. This is a phenomenon known as the "capacity gap." There are several options for addressing it: ① increase the student model size by one tier; ② narrow the task scope, sacrificing generality in favor of specialization; ③ use progressive distillation (teacher-assistant) by inserting an intermediate-sized model; ④ increase the volume or improve the quality of distillation data; ⑤ incorporate methods that align not only outputs but also intermediate-layer features.
The key is to use your evaluation data to identify which tasks are suffering accuracy drops before taking action indiscriminately. Whether all tasks are degrading uniformly or only specific tasks are affected makes a complete difference in what countermeasures will be effective.
Overfitting to a limited set of in-house data produces a model that performs well on training data but breaks down with slightly different inputs. Signs of this include rising validation loss and accuracy degradation on phrasings not present in the training data.
Mitigation strategies include: ① ensuring data diversity (broadly covering the input distribution of real business operations); ② applying regularization (dropout, weight decay, early stopping); ③ mixing general-purpose data with in-house data during training to preserve fundamental language capabilities; and ④ conducting periodic re-evaluation and retraining as needed. Even after production deployment, monitor for "drift" — accuracy degradation caused by shifts in input patterns. One final operational point: a distillation model is not a one-time deliverable. Building a system that continuously cycles through evaluation data updates and re-distillation is the prerequisite for an on-premises AI that remains viable over the long term.
Chi
Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.