
Context engineering is a technique for comprehensively designing the type, structure, order, and volume of information passed to an LLM. It is gaining attention as an approach that breaks through accuracy barriers unreachable by refining prompt wording alone, by addressing the problem from an information design perspective.
The target audience includes AI product developers, prompt designers, and engineers looking to integrate LLMs into their workflows. It is particularly useful for those facing challenges such as "response accuracy doesn't improve even after refining prompts" or "context breaks down in complex tasks."
This article systematically covers the foundational concepts of context engineering, the design principles of information selection, compression, and placement, and practical implementation steps. By the end, readers should have a concrete understanding of where to improve information design in their own LLM deployments.
Conclusion: Context engineering is a technique for designing and optimizing the entire body of information passed to an LLM, enabling accuracy improvements that go beyond refining prompts in isolation.
This concept is organized around three perspectives: how it differs from prompt engineering, what range of information "context" refers to, and why it is attracting attention now.
It is easy to assume at first that "writing prompts more carefully will improve accuracy," but in practice, the overall information design—what to pass, in what order, and how much—tends to have a far greater impact on LLM output quality than the wording of the prompt itself.
This shift in perspective is the fundamental difference between prompt engineering and context engineering.
The distinction between the two can be summarized as follows:
Consider, for example, building an automated customer support system. No matter how polished the prompt wording becomes, if the customer's past order history and inquiry background are not included in the context, the LLM will continue to return off-target responses. The root of the problem lies not in the "quality of the instructions" but in the "absence of information."
In a blog post published in July 2025, the LangChain team categorized the primary strategies for context management into four types: "Write / Select / Compress / Isolate." This represents a design layer clearly distinct from prompt optimization.
Context refers to all information that an LLM can reference when performing inference. It encompasses not only the text written in the prompt, but a broader range of elements.
The main components that make up context are as follows:
For simple Q&A tasks, a system prompt and user input are often sufficient. For agentic tasks spanning multiple steps, however, conversation history, tool execution results, and external knowledge must be managed in combination. Which elements to include in the context depends on the complexity of the task and the required level of accuracy.
Crucially, all of these elements share a finite resource: the token window. According to official Google Cloud information, current models support windows of one to two million tokens, but this is not unlimited.
"No matter how much I refine my prompts, accuracy just won't improve"—many developers have had this experience. The growing attention to context engineering is driven by two converging trends: advances in LLM performance and the increasing complexity of real-world requirements.
The main factors that have drawn attention to the field are as follows:
The LangChain team's blog post from July 2025 also systematized context management into four strategies—"Write / Select / Compress / Isolate"—reflecting the formation of a shared vocabulary across the industry.
Conclusion: Simply refining prompt wording is insufficient for LLMs to properly receive the information they need, and there are structural limitations to improving accuracy this way.
Token window constraints, missing context, and inadequate handling of complex tasks — these three problems are difficult to resolve through prompt improvements alone. Each H3 section digs into the specific reasons why.
As context windows expand, it's tempting to think you can simply pack in all available information. In practice, however, there are reported cases where indiscriminately increasing information actually degrades LLM response accuracy.
The crux of the problem lies in information density. The token window refers to the maximum number of characters or words a model can process at one time. According to official Google Cloud information, some of today's leading models now feature vast windows of one million to two million tokens. However, having a large window and being able to accurately utilize the information within it are two different things.
In concrete terms, the following problems tend to arise:
In a blog post published by the LangChain team in July 2025, they presented four categories of context management strategy: "Write / Select / Compress / Isolate." The idea is that simply adding information (Write) is not enough — selecting, compressing, and isolating information are equally essential operations.
It is more appropriate to think of the token window not as a "capacity" but as a "stage." The more unnecessary props you place on a stage, the more the lead actor's performance fades into the background.
When context is missing, the problem is that LLMs do not respond with "I don't know" — instead, they generate a "plausible-sounding answer" from incomplete information.
There are three main patterns in which incorrect responses tend to occur:
From a conditional branching perspective, if the task is a one-off question-and-answer exchange, the impact of missing context can be kept small. However, for tasks spanning multiple steps or involving judgment in specialized domains, missing context tends to amplify errors in a cascading fashion.
What these patterns have in common is that the root cause is not "insufficient model capability" but rather "poor information design." By properly designing the type, order, and granularity of the information provided, response quality can vary significantly even with the same model. Before adjusting the wording of a prompt, the faster path to improved accuracy is to first diagnose what is missing from the context.
Have you ever had the experience of thinking, "I keep refining my prompts, yet the more complex the task gets, the more inconsistent the output becomes — why?"
When you try to process a sequence like "requirements definition → design → code generation → test specification writing" as a continuous flow, the limitations of prompt design tend to become apparent. There are three main reasons. First, a single prompt cannot retain state — that is, "which step we are currently on" — so as steps progress, prior context is lost and contradictory outputs become more likely. Second, there is no mechanism to dynamically pass the output of one step as the input to the next, which tends to result in users manually copying and pasting between steps. Furthermore, when multiple constraints and roles are packed into a single prompt, the model struggles to determine which instruction to prioritize and tends to return ambiguous responses.
A technical article published by Anthropic also emphasizes the importance of context management in long-running tasks, and introduces a configuration in which sub-agents ultimately return a summary of approximately 1,000 to 2,000 tokens. This is a good example demonstrating that handling complex tasks requires designing the structure of information and how it is passed — not just the prompt itself.
Prompt design is, at its core, a technique for optimizing "a single query." What complex tasks require is a design approach focused on how to select, compress, and determine the timing for delivering information to the model — in other words, the perspective of context engineering.
Not just "what to pass," but "in what order" and "in what quantity" to pass it to the LLM — context engineering is the discipline of systematically designing all three.
In the sections that follow, we will walk through the full picture in sequence: organizing the components that make up context, the design work of selecting, compressing, and arranging information, and the relationship with RAG and memory management.
It's easy to think of context as "just the prompt body," but in reality, the sources of information that influence LLM output quality span a much wider range. Drawing on LangChain's framework for context design, the components can be classified into the following five categories.
These five elements are complementary to one another. For example, even if external knowledge is enriched, if the conversation history contains contradictory premises, the LLM cannot determine which to prioritize, leading to unstable outputs.
A key consideration in design is to always be mindful of the "freshness" and "relevance" of each element. Introducing outdated information or irrelevant data lowers information density and degrades accuracy.
The practical work of context design can be broken down into three tasks: "selection," "compression," and "placement." Each has its own independent axis of judgment, and neglecting any one of them will reduce accuracy.
Selection: What to include in the context
The more information unrelated to the task is included, the harder it becomes for the model to locate the essential information. The criterion for selection comes down to a single point: "Does this directly affect the answer to this task?"
Compression: How to condense information
In LangChain's framework, "Compress" is positioned as a distinct design task within context management strategies. The goal is to save tokens by summarizing or converting long conversation histories and documents into bullet points, while preserving semantic density. For one-off question-answering tasks, a simple summary is sufficient, but for long-running agent processes, incremental compression—such as the compaction approach (context compression via conversation summarization) outlined by Anthropic—is effective.
Placement: In what order to pass information
Even with the same information, the order in which it is passed affects how the model directs its attention. In general, the most important information tends to be referenced more readily when placed at the beginning or end.
Have you ever had the experience of thinking, "We introduced RAG, but somehow the quality of responses just isn't consistent"? In many cases, the cause is not a problem with RAG itself, but rather a design issue concerning how the retrieved information is incorporated into the context.
RAG (Retrieval-Augmented Generation) and memory management are positioned as the primary implementation methods in context engineering. Their relationship can be summarized as follows:
The LangChain team organizes the primary strategies for context management along four axes: "Write / Select / Compress / Isolate." Viewed through this framework, RAG is a representative example of the Write strategy, while memory management functions as a combination of Select and Compress.
In Anthropic's technical articles on agent design for long-running tasks, "compaction (context compression via conversation summarization)" and "structured note-taking (memory management using structured notes)" are cited as important techniques. A configuration in which sub-agents ultimately return a summary of approximately 1,000–2,000 tokens is also introduced, which is precisely an implementation example of context compression.
Conclusion: Leaving misconceptions about context design unaddressed causes improvement efforts to miss the mark.
Context engineering is often accompanied by misconceptions such as "making the prompt longer is enough" or "fine-tuning can serve as a substitute." Each H3 section addresses one of these misconceptions and provides guidance for sound design decisions.
It's natural to initially think that making a prompt longer will improve accuracy. In practice, however, many cases have been reported where designing "what to pass, in what order, and how much" is more effective than simply increasing the amount of information.
The main reasons why longer prompts can be counterproductive are the following three:
Google Cloud's official page notes that current models (e.g., Gemini 3.1) support context windows of 1 million to 2 million tokens, while also introducing context caching that can reduce costs by up to 90%. The expansion of the context window makes it easy to fall into the thinking that "stuffing in more will solve the problem," but from a cost optimization perspective as well, designing to strip away unnecessary information is essential.
In the "Write / Select / Compress / Isolate" framework advocated by the LangChain team, Select (choosing only the necessary information) and Compress (increasing density through compression) are positioned as independent steps. This demonstrates that the core of context design lies in optimizing quality, not increasing quantity.
Prompt length is a means, not an end.
Fine-tuning is a technique for "updating a model's knowledge and behavior," and its purpose is fundamentally different from that of context engineering. Proceeding with the vague assumption that "fine-tuning should improve accuracy" often results in significant cost and time investment without achieving the expected outcome.
When the roles of each approach are clarified, they break down as follows:
When the root cause of incorrect responses is that "necessary information is absent from the context" or "information is presented in an inappropriate order," fine-tuning does not address the underlying problem. No matter how much the model is trained, it is difficult for it to accurately fill in information that was not provided at inference time.
As a decision-making guideline: when a task requires up-to-date information or external data, address it through context design; when the goal is to establish a consistent output style or specialized vocabulary in the model, fine-tuning is effective. Since most practical challenges fall into the former category, it is rational to first attempt improvements to context design.
Fine-tuning is a heavyweight measure that requires both cost and the preparation of training data. The practical approach for maximizing cost-effectiveness in real-world settings is to first exhaust all improvements possible through context engineering, and only then consider fine-tuning for issues that remain unresolved.
"Context design is an engineer's task — it has nothing to do with me." Many business stakeholders and product owners likely hold this view. In reality, however, many of the decisions that determine the quality of context design can only be made by non-engineers who possess domain knowledge.
Context design involves two broad categories of decision-making:
The latter cannot be determined without an understanding of business processes and customer interaction contexts. For example, when building an AI for customer support, decisions such as "which categories of frequently asked questions should be prioritized" and "what caveats should be included in responses" are areas that should be led by frontline staff and business planners.
Even if engineers create a state where "anything can be passed," if the selection of what information to pass is flawed, the AI's output quality will not improve. Errors in information design can become a fundamental problem that cannot be compensated for by refining prompt wording alone.
In practice, the following division of roles tends to function well:
Reframing context engineering as a design activity for the entire team is the first step toward raising the overall accuracy of LLM utilization.
Conclusion: Improving LLM accuracy depends on design principles governing "what to place, where, and how" within the context.
The three primary design variables that determine response quality are: the ordering of information, its density, and dynamic switching. Each principle is explained in detail in the H3 sections that follow.
It is tempting to think that "packing in as much information as possible will improve accuracy," but in practice, the placement order of context has a significant impact on response quality.
LLMs do not reference the entire context window uniformly — they tend to weight information placed in the earlier part of the input more heavily. This is known as the "primacy effect," and experimental reports indicate that information buried in the middle of a long context is prone to being overlooked.
With this characteristic in mind, the basic design principles follow naturally. First, place the task definition, constraints, and goals at the very beginning so the model can grasp what it needs to do from the outset. Next, follow with the most relevant documents and facts — reference documents retrieved via RAG should generally be positioned toward the beginning. Supplementary information and background knowledge should be consolidated toward the end, as placing non-essential information early introduces noise. Additionally, examples (few-shot) placed immediately after the task definition ensure that context is aligned just before the instructions, making it easier for the model to carry them through.
For example, when designing a chatbot that answers questions by referencing internal documents, it is considered effective to explicitly state "scope of responses, tone, and prohibited content" at the beginning of the system prompt, followed immediately by the relevant chunks retrieved through search. Placing the user's question after this structure tends to improve the consistency of responses.
The notion that including more information in the context always improves accuracy is not necessarily correct. When irrelevant information or redundant expressions are mixed in, LLMs become more likely to overlook the information they should actually be focusing on.
The foundation of noise reduction is to "physically remove information unrelated to the task." Specifically, organize content with the following considerations:
An effective strategy for increasing information density is the "Compress" approach advocated by LangChain. By summarizing long documents before inserting them into the context, only the essential information can be packed into the limited token budget.
As a decision-making guideline: when a task can be completed by referencing a single document, inserting it in full is unlikely to cause problems; however, when combining multiple sources, consider compressing each source before passing it. In the latter case, consuming tokens without compression risks making the later sources effectively inaccessible to the model.
Additionally, structured formatting using bullet points and headings also contributes to higher information density. Structured text tends to be easier for the model to reference than prose.
"Why does accuracy drop for specific questions when I'm handling all tasks with the same system prompt?" — this is a question shared across many development teams. In most cases, the root cause lies in the context being statically fixed.
Dynamic context generation is a design approach in which the information passed to an LLM is reorganized at runtime based on the user's input and the type of task. Rather than using a fixed prompt, only the information relevant to the current situation is selected to construct the context.
Concretely, this includes the following types of switching:
The "Write / Select / Compress / Isolate" classification proposed by LangChain is a systematization of this dynamic information management. In particular, the combination of "Select" and "Compress" forms the core of task-based switching.
As an implementation note, the more complex the switching logic becomes, the higher the maintenance cost tends to be. A practical approach is to first check whether the task types can be narrowed down to two or three, and to start with simple conditional branching.
Now that the concept is understood, the next step is to translate it into implementation. This section walks through a concrete approach in two steps, from diagnosing the problem to designing and implementing the context structure.
It's easy to initially think that "writing the prompt more carefully will improve accuracy," but when the root cause of the problem actually lies in context design, repeatedly rewriting the prompt will hit a ceiling with no real improvement. Identifying "what the problem is" through diagnosis first is the shortest path to the next step.
In the diagnosis, take stock of the current state of your prompt design from the following perspectives:
As a concrete diagnostic procedure, start by collecting a certain number of logs from cases where incorrect answers or accuracy degradation actually occurred. Then, for each case, articulate the gap between "the context the model had" and "the context that would have been needed for the correct answer." If this gap repeatedly appears in the same pattern, that is the design bottleneck.
Organizing the diagnostic results in a simple table format makes it easier to hand off to the next design phase.
Once the issues have been clarified through the Step 1 diagnosis, the next stage is to concretely design and implement the context structure.
The foundation of the design is to think around the four operations — "Write / Select / Compress / Isolate" — proposed by the LangChain team.
It is important to shift the emphasis depending on the nature of the task. For one-off Q&A tasks, prioritize Select and Compress; for long-running agent-type tasks, center the design around Write and Isolate.
During implementation, it is recommended to start small.
Chi
Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.