Alpha Notice: These docs cover the v1-alpha release. Content is incomplete and subject to change.For the latest stable version, see the v0 LangChain Python or LangChain JavaScript docs.
Agentic applications let an LLM decide its own next steps to solve a problem. That flexibility is powerful, but the model’s black-box nature makes it hard to predict how a tweak in one part of your agent will affect the rest. To build production-ready agents, thorough testing is essential.There are a few approaches to testing your agents:
Unit tests exercise small, deterministic pieces of your agent in isolation using in-memory fakes so you can assert exact behavior quickly and deterministically.
Integration tests test the agent using real network calls to confirm that components work together, credentials and schemas line up, and latency is acceptable.
Agentic applications tend to lean more on integration because they chain multiple components together and must deal with flakiness due to the nondeterministic nature of LLMs.
Many agent behaviors only emerge when using a real LLM, such as which tool the agent decides to call, how it formats responses, or whether a prompt modification affects the entire execution trajectory. LangChain’s agentevals package provides evaluators specifically designed for testing agent trajectories with live models.AgentEvals lets you easily evaluate the trajectory of your agent (the exact sequence of messages, including tool calls) by performing a trajectory match or by using an LLM judge:
AgentEvals offers the createTrajectoryMatchEvaluator function to match your agent’s trajectory against a reference trajectory. There are four modes to choose from:
Mode
Description
Use Case
strict
Exact match of messages and tool calls in the same order
Testing specific sequences (e.g., policy lookup before authorization)
unordered
Same tool calls allowed in any order
Verifying information retrieval when order doesn’t matter
subset
Agent calls only tools from reference (no extras)
Ensuring agent doesn’t exceed expected scope
superset
Agent calls at least the reference tools (extras allowed)
Verifying minimum required actions are taken
Strict match
The strict mode ensures trajectories contain identical messages in the same order with the same tool calls, though it allows for differences in message content. This is useful when you need to enforce a specific sequence of operations, such as requiring a policy lookup before authorizing an action.
Copy
Ask AI
import { createAgent } from "langchain"import { tool } from "@langchain/core/tools";import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";import { createTrajectoryMatchEvaluator } from "agentevals";import { z } from "zod";const getWeather = tool( async ({ city }: { city: string }) => { return `It's 75 degrees and sunny in ${city}.`; }, { name: "get_weather", description: "Get weather information for a city.", schema: z.object({ city: z.string(), }), });const agent = createAgent({ llm: "openai:gpt-4o", tools: [getWeather]});const evaluator = createTrajectoryMatchEvaluator({ trajectoryMatchMode: "strict",});async function testWeatherToolCalledStrict() { const result = await agent.invoke({ messages: [new HumanMessage("What's the weather in San Francisco?")] }); const referenceTrajectory = [ new HumanMessage("What's the weather in San Francisco?"), new AIMessage({ content: "", tool_calls: [ { id: "call_1", name: "get_weather", args: { city: "San Francisco" } } ] }), new ToolMessage({ content: "It's 75 degrees and sunny in San Francisco.", tool_call_id: "call_1" }), new AIMessage("The weather in San Francisco is 75 degrees and sunny."), ]; const evaluation = await evaluator({ outputs: result.messages, referenceOutputs: referenceTrajectory }); // { // 'key': 'trajectory_strict_match', // 'score': true, // 'comment': null, // } expect(evaluation.score).toBe(true);}
Unordered match
The unordered mode allows the same tool calls in any order, which is helpful when you want to verify that specific information was retrieved but don’t care about the sequence. For example, an agent might need to check both weather and events for a city, but the order doesn’t matter.
Copy
Ask AI
import { createAgent } from "langchain"import { tool } from "@langchain/core/tools";import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";import { createTrajectoryMatchEvaluator } from "agentevals";import { z } from "zod";const getWeather = tool( async ({ city }: { city: string }) => { return `It's 75 degrees and sunny in ${city}.`; }, { name: "get_weather", description: "Get weather information for a city.", schema: z.object({ city: z.string() }), });const getEvents = tool( async ({ city }: { city: string }) => { return `Concert at the park in ${city} tonight.`; }, { name: "get_events", description: "Get events happening in a city.", schema: z.object({ city: z.string() }), });const agent = createAgent({ llm: "openai:gpt-4o", tools: [getWeather, getEvents]});const evaluator = createTrajectoryMatchEvaluator({ trajectoryMatchMode: "unordered",});async function testMultipleToolsAnyOrder() { const result = await agent.invoke({ messages: [new HumanMessage("What's happening in SF today?")] }); // Reference shows tools called in different order than actual execution const referenceTrajectory = [ new HumanMessage("What's happening in SF today?"), new AIMessage({ content: "", tool_calls: [ { id: "call_1", name: "get_events", args: { city: "SF" } }, { id: "call_2", name: "get_weather", args: { city: "SF" } }, ] }), new ToolMessage({ content: "Concert at the park in SF tonight.", tool_call_id: "call_1" }), new ToolMessage({ content: "It's 75 degrees and sunny in SF.", tool_call_id: "call_2" }), new AIMessage("Today in SF: 75 degrees and sunny with a concert at the park tonight."), ]; const evaluation = await evaluator({ outputs: result.messages, referenceOutputs: referenceTrajectory, }); // { // 'key': 'trajectory_unordered_match', // 'score': true, // } expect(evaluation.score).toBe(true);}
Subset and superset match
The superset and subset modes match partial trajectories. The superset mode verifies that the agent called at least the tools in the reference trajectory, allowing additional tool calls. The subset mode ensures the agent did not call any tools beyond those in the reference.
Copy
Ask AI
import { createAgent } from "langchain"import { tool } from "@langchain/core/tools";import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";import { createTrajectoryMatchEvaluator } from "agentevals";import { z } from "zod";const getWeather = tool( async ({ city }: { city: string }) => { return `It's 75 degrees and sunny in ${city}.`; }, { name: "get_weather", description: "Get weather information for a city.", schema: z.object({ city: z.string() }), });const getDetailedForecast = tool( async ({ city }: { city: string }) => { return `Detailed forecast for ${city}: sunny all week.`; }, { name: "get_detailed_forecast", description: "Get detailed weather forecast for a city.", schema: z.object({ city: z.string() }), });const agent = createAgent({ llm: "openai:gpt-4o", tools: [getWeather, getDetailedForecast]});const evaluator = createTrajectoryMatchEvaluator({ trajectoryMatchMode: "superset",});async function testAgentCallsRequiredToolsPlusExtra() { const result = await agent.invoke({ messages: [new HumanMessage("What's the weather in Boston?")] }); // Reference only requires getWeather, but agent may call additional tools const referenceTrajectory = [ new HumanMessage("What's the weather in Boston?"), new AIMessage({ content: "", tool_calls: [ { id: "call_1", name: "get_weather", args: { city: "Boston" } }, ] }), new ToolMessage({ content: "It's 75 degrees and sunny in Boston.", tool_call_id: "call_1" }), new AIMessage("The weather in Boston is 75 degrees and sunny."), ]; const evaluation = await evaluator({ outputs: result.messages, referenceOutputs: referenceTrajectory, }); // { // 'key': 'trajectory_superset_match', // 'score': true, // 'comment': null, // } expect(evaluation.score).toBe(true);}
You can also set the toolArgsMatchMode property and/or toolArgsMatchOverrides to customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference. By default, only tool calls with the same arguments to the same tool are considered equal. Visit the repository for more details.
You can also use an LLM to evaluate the agent’s execution path with the createTrajectoryLLMAsJudge function. Unlike the trajectory match evaluators, it doesn’t require a reference trajectory, but one can be provided if available.
Without reference trajectory
Copy
Ask AI
import { createAgent } from "langchain"import { tool } from "@langchain/core/tools";import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";import { z } from "zod";const getWeather = tool( async ({ city }: { city: string }) => { return `It's 75 degrees and sunny in ${city}.`; }, { name: "get_weather", description: "Get weather information for a city.", schema: z.object({ city: z.string() }), });const agent = createAgent({ llm: "openai:gpt-4o", tools: [getWeather]});const evaluator = createTrajectoryLLMAsJudge({ model: "openai:o3-mini", prompt: TRAJECTORY_ACCURACY_PROMPT,});async function testTrajectoryQuality() { const result = await agent.invoke({ messages: [new HumanMessage("What's the weather in Seattle?")] }); const evaluation = await evaluator({ outputs: result.messages, }); // { // 'key': 'trajectory_accuracy', // 'score': true, // 'comment': 'The provided agent trajectory is reasonable...' // } expect(evaluation.score).toBe(true);}
With reference trajectory
If you have a reference trajectory, you can add an extra variable to your prompt and pass in the reference trajectory. Below, we use the prebuilt TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE prompt and configure the reference_outputs variable:
For tracking experiments over time, you can log evaluator results to LangSmith, a platform for building production-grade LLM applications that includes tracing, evaluation, and experimentation tools.First, set up LangSmith by setting the required environment variables:
Integration tests that call real LLM APIs can be slow and expensive, especially when run frequently in CI/CD pipelines. We recommend using a library for recording HTTP requests and responses, then replaying them in subsequent runs without making actual network calls.You can use nock to achieve this. Request/responses are recorded in cassettes, which are then used to mock the real network calls in subsequent runs.
When you modify prompts, add new tools, or change expected trajectories, delete the corresponding cassette files and rerun the tests to record fresh interactions.