Evals, tests, and feedback

Agents improve when you close the loop: define expectations (tests / evals), run workflows in production or staging, collect human or automated feedback, and use that signal to change prompts, tools, graph structure, or routing.

Agent test examples

Store input/output pairs (and optional quality grades) on an agent so you can regression-check behavior as you edit the graph. In Python, use client.tests.create_tests with TestExampleItem rows (input/output types and content, graded quality, and a short reason):

import asyncio
from agentsapi import AgentServiceAPIClient
from agentsapi.models.agent_tests import TestExampleItem

async def seed_tests():
    client = AgentServiceAPIClient()
    client.set_api_key("sk_...")

    examples = [
        TestExampleItem(
            input_type="text",
            input_content="Summarize: Q3 revenue was up 12%.",
            input_description="",
            output_type="text",
            output_content="Q3 revenue increased about 12% year over year.",
            output_description="",
            graded_quality="high",
            quality_reason="Accurate, concise summary aligned with input.",
        ),
    ]
    await client.tests.create_tests("YOUR_AGENT_UUID", examples)

# asyncio.run(seed_tests())

Use client.tests.get_tests(agent_id) (and related helpers) to read back what is stored. Exact fields match your product’s test UX and any CI you run against the API.

Automated graders

Default path: On Sudoiq, platform analyst agents normally design and attach graders for you (rubric text, versions, proposed graders for comparison, promotion in the product UI). You do not need to call the API to get a useful eval loop.This section is reference for engineers: exact SDK calls, the standard grader JSON shape, and optional post-hoc or pipeline-integrated grading when you build custom automation or debug outside the guided flow.

A grader is a dedicated grader-type workflow linked to an agent. You give it a natural-language rubric; the service builds a small graph that scores a completed run (original prompt plus node output) and returns structured JSON. You can iterate on graders over time—adjust the rubric, create proposed graders for side-by-side comparison before promoting one, and treat grading like any other workflow you refine. For post-hoc scoring you call the client after the fact (see below). For tighter integration, graph nodes can reference grader workflows as blocking or non-blocking exit validation so evaluation runs in the execution pipeline (for example validating a specific node before the run continues or finishes).

Create a grader

Use AgentServiceAPIClient.graders.create_grader: pass rubric (task-specific criteria), rubric_name, and optionally proposed=True when the agent already has a primary grader and you want a non-primary grader for A/B comparison. The response includes grader_agent_id (the new grader workflow), agent_id, version, and message. The platform appends a fixed “standard rubric” block to your text so model outputs stay in the JSON shape below—you only maintain the custom part of the rubric.

from agentsapi import AgentServiceAPIClient

client = AgentServiceAPIClient()
client.set_api_key("sk_...")
resp = await client.graders.create_grader(
    agent_id="YOUR_AGENT_UUID",
    rubric="The answer must cite the quarter and percentage change. Penalize hallucinated figures.",
    rubric_name="Q&A accuracy grader",
    proposed=False,  # True: add a proposed grader if a primary already exists
)
# resp.grader_agent_id — grader workflow id

Standard grader output format

The grader model is instructed to return JSON with exactly these top-level fields:

Field	Type	Meaning
`pass`	boolean	Whether the graded run satisfied the task to your rubric’s satisfaction.
`quality_score`	string	One of `very-low`, `low`, `medium`, `high`, `very-high`.
`quality_attributes`	array	Each element is an object with a single key: the same tier string as `quality_score`, and the value is a reason string. Multiple entries are allowed, including several objects with the same key.

Example (matches the platform’s reference shape in GraderOutputExample):

{
  "pass": true,
  "quality_score": "high",
  "quality_attributes": [
    { "high": "The agent interpreted the task and produced accurate results." },
    { "high": "Output structure matched expectations." },
    { "medium": "Slightly verbose but complete." }
  ]
}

For parsing in Python, agentsapi.models.grader.GraderOutputResult validates this shape. To inspect the example_output shape before you create a grader, use await client.graders.get_agent_rubric(agent_id) — the response includes it even when has_grader is false.

Run a grader on a finished run

After a run completes, await client.graders.execute_grader(agent_id=..., task_id=..., node_id=..., tenant_id=...) starts the grader workflow against that run (optional node_id grades a specific node’s output, for example from non-blocking exit validation). The call returns an ExecuteAgentResponse with the grader’s task_id when a grader run was started, or None if no compatible grader was found.

from agentsapi import ExecuteAgentResponse

# After `client = AgentServiceAPIClient()` and `set_api_key`:
started: ExecuteAgentResponse | None = await client.graders.execute_grader(
    agent_id="YOUR_AGENT_UUID",
    task_id=12345,
    # node_id="some_node",  # optional
    # tenant_id="...",      # optional, for tenant-scoped runs
)

Use await client.graders.get_grader_runs_for_agent(agent_id) to list grader runs and read stored grader_result payloads.

Execution feedback

After a run completes (or while reviewing it), record structured notes or simple pass/fail for optimization and analytics. The agentsapi namespaces client.execution_feedback and client.agent_feedback (aliases) expose methods such as:

list_feedback — page through feedback for an agent’s runs (cursor parameters vary by release; see your installed client).
save_feedback — attach detailed ExecutionFeedbackData for a task_id (admin vs user source).
save_simple_feedback / get_simple_feedback — binary passed / failed for a run, with optional source.

# After you know task_id from a run:
await client.agent_feedback.save_simple_feedback(
    task_id=12345,
    passed=True,
    source="user",
)

On Sudoiq, analyst agents continuously review the same signals—stored tests, grader results, human or simple feedback, and run telemetry—as they arrive. They use that picture to recommend changes (prompts, graph structure, tools) and to try improvements in the product (for example proposing rubrics, running comparisons, or staging tweaks) so you are not only collecting data but driving an eval loop inside the platform.

How this improves workflows

Tests define “what good looks like” on fixed examples—use them before/after edits to catch regressions.
Live runs surface real user or operational inputs.
Feedback (simple or rich) marks which runs succeeded or failed expectations.
Graders score outputs against your rubric so you can trend quality and compare workflow versions.

Together, these feeds are what analyst agents and your own processes use to steer continuous improvement (see above). This page documents the main Python client entry points for that data: client.tests, client.graders, and client.agent_feedback on AgentServiceAPIClient (see Python client). Your workspace may also offer UI or batch eval APIs on top of the same signals.

Python client
Output validators
CLI — agentservice demo for a sample tool/validator/handler loop

Introduction

Quickstart

Concepts

Clients

Backend

Agents

Tools

Validation

Tutorials

Reference

Evals, tests, and feedback

Agent test examples

Automated graders

Create a grader

Standard grader output format

Run a grader on a finished run

Execution feedback

How this improves workflows

Introduction

Quickstart

Concepts

Clients

Backend

Agents

Tools

Validation

Tutorials

Reference

Documentation Index

​Agent test examples

​Automated graders

​Create a grader

​Standard grader output format

​Run a grader on a finished run

​Execution feedback

​How this improves workflows

​Related

Agent test examples

Automated graders

Create a grader

Standard grader output format

Run a grader on a finished run

Execution feedback

How this improves workflows

Related