
AgentOps best practices refer to the operational patterns and practical know-how for continuously running the observe–evaluate–improve loop in order to keep AI agents running stably in production.
This article is aimed at DX promotion managers and AI operations leads who are working to bring AI agents into production. It explains, from a hands-on perspective, a "reference model" for stabilizing operations, a day-to-day improvement cycle, and concrete best practices. The definition of AgentOps, its overall picture, and the design of operations organizations are covered in AgentOps とは — AIエージェント運用組織の設計ガイド, so this article focuses specifically on "how to run operations." By the end, you will have a clear path for embedding an observe–evaluate–improve framework into your own agent operations and steadily raising quality while avoiding failures.
The core of AgentOps operations is a single loop that keeps cycling through observe → evaluate → improve. Observe the outputs, evaluate their quality, and feed the results back into configuration and operations — having this cycle as a built-in mechanism is the prerequisite for stable operations.
With AI agents, it is difficult to see from the outside "what was input, which tools were called, what intermediate decisions were made, and what was output." That is why the first step in observability is to record this process in a way that can be traced after the fact. The minimum you want to have in place is: (1) the input prompt and final output, (2) the tools and APIs called and their results, (3) the time taken and token consumption at each step, and (4) where errors or interruptions occurred.
Particularly unique to agents is the concept of "tracing" — a record that allows you to follow, in a tree structure, how a single request branches into multiple steps during execution. With only single-shot input/output logs, it is impossible to tell at which step a wrong decision was made. With tracing, it becomes possible to pinpoint causes such as "an incorrect argument was passed on the third tool call."
A few practical notes on getting started with observability: since logs tend to contain personal information and confidential data, set up a masking mechanism before storing them. If storing all requests in full would drive up costs, a tiered approach works well — full retention for errors and low-scoring cases, sampling for normal cases. Recording can be started either with a dedicated observability tool or simply by storing structured logs in a database. What matters is not the sophistication of the tooling, but achieving a state where "any single request can be traced and reproduced after the fact." Operating without observability is like navigating without instruments.
Evaluation is what turns the data collected through observation into a judgment of quality. There are two types of evaluation. Offline evaluation involves running the agent against a pre-prepared set of inputs and expected outputs (an evaluation dataset) to verify correct behavior in advance; it is used as a quality gate before release. Online evaluation continuously measures quality against actual inputs and outputs that occur in production.
The two serve different roles. Offline evaluation verifies "is it broken?" before release, while online evaluation monitors "is it behaving as expected on real-world inputs?" after release. For tasks where correct answers can be determined mechanically — such as classification, extraction, or code generation — automated evaluation is effective. For tasks like summarization or dialogue, where there is no single correct answer, it is practical to combine rubric-based human evaluation with methods that use a separate model as a scorer.
The key to evaluation is deciding on pass/fail thresholds in advance. Without criteria such as "do not release if accuracy falls below a certain level," scores are produced but cannot be used for decision-making. Furthermore, since it is not realistic to have humans review every production case in online evaluation, a practical approach is to prioritize sampling of low-scoring cases and errors for human review. The design of agent evaluation builds on the foundational concepts of accuracy evaluation for standalone LLMs. For more detail, please also refer to ラオス語対応LLMの精度を測る評価フレームワーク.
The improvement step is about feeding the issues identified through evaluation back into operations. Without systematizing this step, you end up in a state where "the problem is understood but never fixed." The primary destinations for improvement are: prompt revisions, addition or modification of tools, review of the agent's procedure (workflow), and expansion of the evaluation dataset itself.
The key is not to treat improvements as one-off fixes. Failure cases discovered in production should always be added to the evaluation dataset. This way, if the same failure recurs, it will be caught in the next pre-release evaluation, and improvements accumulate over time. Observe → evaluate → improve is not something that ends after one cycle; think of it as a loop that increases in accuracy the more it is run. Starting with manual processes is fine, but as operations mature, embedding this feedback process into a regular operational workflow is the most direct path to stable performance.

Autonomously acting agents can have a significant impact when they malfunction. Build into your operations from the start a mechanism that keeps humans involved in critical decisions and allows for quick recovery when problems occur.
Agents call tools on their own and take actions that affect the outside world. That is precisely why boundaries must be established for "what is permissible" — i.e., guardrails. The basics include input/output filters (blocking inappropriate content or confidential information), an allowlist for tool execution, and pre-transmission checks before sending data externally.
On top of that, incorporate HITL (Human-in-the-Loop) for high-impact decisions. Rather than running everything fully automated, design the system so that irreversible operations — such as contracts, fund transfers, data deletion, and external publication — require human approval before execution. The key is to draw the line during the operational design phase, determining which operations remain under human control.
In addition, keep the permissions granted to tools to a minimum. Permission design — such as not granting write access to tasks that only require read access, and avoiding direct modifications to production databases — limits the damage in the event of a malfunction. The key point in building guardrails is not to "trust the intelligent model," but to "structurally restrict the scope of permissible actions." It should also be noted that "Shadow AI" — where staff begin using agents on their own outside official frameworks — also undermines reliability. Since usage that falls outside the governance net cannot be observed or evaluated, it is advisable to make such usage visible, in conjunction with the perspective of Shadow AI risks and governance.
No matter how thoroughly you evaluate in advance, there will always be issues that only surface in production. That is why "not deploying everything at once" and "being able to roll back quickly" should be built into the system. Staged releases involve initially limiting deployment to a subset of users or a subset of operations, then expanding the scope once no issues are found. This approach allows you to verify real-world behavior while containing the impact. The "start small, then expand" approach is especially effective for new agents or significant changes.
In addition, prepare rollback procedures in advance. Whenever prompts, tools, or model configurations are changed, record each version so that you can immediately revert to the previous version if a problem arises. Because small configuration changes in agents can lead to significant behavioral shifts, maintaining a state where "changes can be undone" serves as a safety net. To avoid hesitation when deciding to roll back, it is also advisable to define criteria in advance — specifying which metrics, if they fall below a certain threshold, will trigger a rollback. Designing new feature releases and rollbacks as a single package is a fundamental practice for maintaining reliability.

Quality is not something you build once and leave alone — it must be continuously improved. Cultivating evaluation datasets, detecting regressions, and version-controlling changes: these three points are what raise the quality floor.
The heart of quality improvement is the evaluation dataset. This is a collection of representative examples of "given this input, here is how we want the agent to behave," and it serves as a measuring stick for agent quality. Starting with a small number of examples is fine, but by continuously adding failure cases and edge cases discovered in production, it grows into a practical benchmark tailored to your organization's specific operations.
Having this dataset makes regression detection possible. When prompts or models are changed, you can automatically check before release whether inputs that were previously handled correctly have broken. Regression detection is particularly important for agents, as they are prone to "fixing one problem only to have another resurface." Running the evaluation dataset with every change and checking whether any scores have declined — this habit is what keeps improvements moving "forward."
Agent behavior is determined by a combination of prompts, tool definitions, model settings, and workflows. Version-controlling these "just like code" is a prerequisite for reproducibility and improvement. Without a record of what changed in which version, it becomes impossible to trace why something that "worked fine last week is performing poorly today."
In practice, this means managing prompts in a repository alongside code and maintaining a change history. Specification changes to tools and model switches should also be logged. This way, when quality shifts, you can pinpoint exactly "when and what changed," and quickly roll back to a previous version. Since models are continuously updated, avoiding over-tuning to the quirks of a specific version—and instead keeping things swappable through version control—pays dividends over the long run.

Operations cannot be sustained on a vague sense that things have "gotten better." By tracking quality, cost, and latency as concrete numbers and ultimately tying them to business metrics, improvement priorities become clear.
Operational health is best assessed along three axes. Quality is measured by metrics such as accuracy on evaluation datasets, task completion rates, and pass rates in human review. Cost covers token consumption and charges per request, as well as total monthly cost. Latency refers to response time, and in cases where an agent takes multiple steps, the total time to completion.
These three axes exist in a trade-off relationship. Pursuing higher quality by using more powerful models or multi-step reasoning drives up cost and latency, while cutting costs can sometimes degrade quality. That is precisely why all three axes should be visualized simultaneously—to find the balance point that "maximizes quality within acceptable cost and speed constraints."
In practice, it is important to look at distributions, not just averages. Even if average latency is within acceptable bounds, a subset of requests that are extremely slow can make the system unusable for certain workflows. Keeping track of high-latency requests (percentiles) and the characteristics of high-cost requests helps narrow down what needs to be improved. Focusing on only one axis at a time makes it easy to end up with accidents like a runaway cost spike while trying to improve quality. Monitoring all three axes together forms the foundation of sound operations.
Technical metrics alone are insufficient for explaining the value of operations to leadership. Ultimately, the goal is to show how agent activity translates into business outcomes. For example, in customer support, this might mean first-contact resolution rates or reductions in handling time; for internal operations, it could be the number of cases processed or a decrease in rework—linking results to on-the-ground performance indicators.
What makes this effective is placing technical metrics and business metrics side by side on a single dashboard. This allows you to see, on the same timeline, whether resolution rates rose during periods when quality scores improved, and whether the results justify any increase in cost. Being able to demonstrate that technical improvements are moving business numbers provides a basis for deciding whether to continue investing in operations. Conversely, if technical metrics look good but business metrics remain flat, it is a signal to revisit the very definition of the problem being solved.

Understanding which stage your organization's AgentOps is at clarifies where to invest next. The right actions differ entirely between a stage where there is no observability at all and one where improvement cycles run automatically.
The maturity of AgentOps is easiest to understand when broken down into roughly four stages.
Most organizations are at Stage 1 or 2. What matters is not leaping straight for a higher stage, but honestly assessing where your organization stands today. Aiming for sophisticated automated improvement without observability in place will not work—there is no foundation to build on. To gauge your current stage, ask yourself: "Were we able to reproduce and explain the cause of our most recent issue from recorded data?" If not, you are at Stage 1. If you could, but evaluation still relies on human intuition, you are at Stage 2. Grounding your assessment in actual events like this will give you a more accurate picture.
The key to moving up one stage is satisfying the "foundation" of that stage. Going from Stage 1 to Stage 2 means first putting in place a minimal mechanism for retaining logs and traces—without this, you cannot move forward. Going from Stage 2 to Stage 3 means building an evaluation dataset, even a small one, and establishing the habit of running it before each release. Going from Stage 3 to Stage 4 means embedding the flow of feeding production failures back into evaluation data into your operational processes, rather than relying on individual goodwill.
Rushing ahead and skipping stages will leave you with a system that looks right on the surface but does not actually work. For example, introducing "automated improvement" without an evaluation dataset means you have no way to judge whether improvements are good or bad, and quality becomes more unstable as a result. Solidifying the foundation of your current stage before moving to the next—following this order—is ultimately the fastest path.

Organizations that struggle with AgentOps tend to share common patterns. Here are two representative examples, along with ways to avoid them.
The most frequent failure is shipping to production without an evaluation mechanism in place. Assuming that because it worked in a demo it will be fine, teams deploy their agents only to find unexpected errors arising from the diversity of real-world inputs. Worse, without an evaluation dataset, there is no way to verify whether a fix actually resolved the issue, causing improvement efforts to spin their wheels.
The mitigation is straightforward. There is no need to wait for a perfect evaluation infrastructure. Start by preparing a few dozen representative inputs paired with their expected behaviors, and make it a rule to run them before every release. This alone is enough to prevent "obvious regressions." Each time a problem surfaces in production, add that case to the dataset, and your benchmark will grow naturally. Simply moving from "zero evaluation" to "minimal evaluation" makes a significant difference in operational stability.
Another common pitfall is focusing exclusively on quality while deferring cost and latency to later. Because agents call models and use tools multiple times per request, they tend to cost more and respond more slowly than single-turn chat interactions. Teams often do not notice this during validation, only to discover after usage scales up that "costs are several times higher than expected" or that "it's too slow for anyone to use."
The mitigation is to include cost and latency as observables from the very beginning. Visualize consumption per request and design with an eye toward balancing quality against these constraints. For example, routing simple requests to lightweight processing while reserving heavy inference only for complex requests—rather than applying maximum resources to everything—is an effective approach. Quality, cost, and speed should be treated as things to design for simultaneously, right from the start.

A compilation of frequently asked questions from the field about AgentOps operations.
AgentOps is characterized by its handling of operational challenges unique to AI agents that make autonomous decisions and take actions — including multi-step execution, tool use, and non-deterministic outputs. MLOps refers to lifecycle management for training, deploying, and monitoring machine learning models, while AIOps refers to the practice of using AI to streamline IT operations themselves. AgentOps builds on the principles of MLOps while going deeper into agent-specific observability and evaluation. For an overview of AgentOps, see What is AgentOps — A Design Guide for AI Agent Operations Organizations.
The standard approach is to start with observability. Once you have a way to log inputs, outputs, and traces, you can begin to see what is actually happening. Next, prepare a dozen or so representative inputs and introduce a minimal evaluation process to run before each release. These two steps alone move you from a trial-and-error phase to one with observability and evaluation in place. There is no need to introduce advanced tooling until that foundation is established. Rather than aiming for perfection from the start, the key to success is to begin running small, tight loops.

AgentOps best practices ultimately come down to one thing: "continuously running the observe → evaluate → improve loop as a structured process." Observe outputs, measure quality against an evaluation dataset, and feed production failures back into improvements. By combining this cycle with reliability and safety guardrails, HITL, staged releases and rollbacks, KPIs for quality, cost, and latency, and version control, an agent can grow from a "working demo" into a "stable operational foundation for business use."
The starting point is an honest assessment of your organization's current maturity level. If observability is lacking, begin with logging; if evaluation is lacking, start with a minimal dataset. Solidify the foundation at your current stage before moving on — this sequence is ultimately the fastest path forward. The definition of AgentOps and guidance on designing an operations organization are covered in detail in What is AgentOps — A Design Guide for AI Agent Operations Organizations. We are also actively helping organizations build production operations and improvement frameworks for AI agents. If you are struggling with operational stability, please feel free to reach out to us.
Chi
Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.