How to define a code evaluator

Evaluators

Code evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics. These functions can be passed directly into evaluate() / aevaluate().

Basic example

from langsmith import evaluate

def correct(outputs: dict, reference_outputs: dict) -> bool:
    """Check if the answer exactly matches the expected answer."""
    return outputs["answer"] == reference_outputs["answer"]

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct]
)

Evaluator args

code evaluator functions must have specific argument names. They can take any subset of the following arguments:

run: Run: The full Run object generated by the application on the given example.
example: Example: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).
inputs: dict: A dictionary of the inputs corresponding to a single example in a dataset.
outputs: dict: A dictionary of the outputs generated by the application on the given inputs.
reference_outputs/referenceOutputs: dict: A dictionary of the reference outputs associated with the example, if available.

For most use cases you’ll only need inputs, outputs, and reference_outputs. run and example are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application. When using JS/TS these should all be passed in as part of a single object argument.

Evaluator output

Code evaluators are expected to return one of the following types: Python and JS/TS

dict: dicts of the form {"score" | "value": ..., "key": ...} allow you to customize the metric type (“score” for numerical and “value” for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.

Python only

int | float | bool: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.
str: this is intepreted as a categorical metric. The function name is used as the name of the metric.
list[dict]: return multiple metrics using a single function.

Additional examples

Requires langsmith>=0.2.0

from langsmith import evaluate, wrappers
from langsmith.schemas import Run, Example
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel

# We can still pass in Run and Example objects if we'd like
def correct_old_signature(run: Run, example: Example) -> dict:
    """Check if the answer exactly matches the expected answer."""
    return {"key": "correct", "score": run.outputs["answer"] == example.outputs["answer"]}

# Just evaluate actual outputs
def concision(outputs: dict) -> int:
    """Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
    return min(len(outputs["answer"]) // 1000, 4) + 1

# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())

async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
    """Use an LLM to judge if the reasoning and the answer are consistent."""
    instructions = """\
Given the following question, answer, and reasoning, determine if the reasoning for the \
answer is logically valid and consistent with question and the answer."""

    class Response(BaseModel):
        reasoning_is_valid: bool

    msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
    response = await oai_client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
        response_format=Response
    )
    return response.choices[0].message.parsed.reasoning_is_valid

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct_old_signature, concision, valid_reasoning]
)

Evaluate aggregate experiment results: Define summary evaluators, which compute metrics for an entire experiment.
Run an evaluation comparing two experiments: Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

How to define a code evaluator

Basic example

Evaluator args

Evaluator output

Additional examples

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​Basic example

​Evaluator args

​Evaluator output

​Additional examples

​Related

Basic example

Evaluator args

Evaluator output

Additional examples

Related