Agent test examples
Store input/output pairs (and optional quality grades) on an agent so you can regression-check behavior as you edit the graph. In Python, useclient.tests.create_tests with TestExampleItem rows (input/output types and content, graded quality, and a short reason):
client.tests.get_tests(agent_id) (and related helpers) to read back what is stored. Exact fields match your product’s test UX and any CI you run against the API.
Automated graders
Default path: On Sudoiq, platform analyst agents normally design and attach graders for you (rubric text, versions, proposed graders for comparison, promotion in the product UI). You do not need to call the API to get a useful eval loop.This section is reference for engineers: exact SDK calls, the standard grader JSON shape, and optional post-hoc or pipeline-integrated grading when you build custom automation or debug outside the guided flow.
Create a grader
UseAgentServiceAPIClient.graders.create_grader: pass rubric (task-specific criteria), rubric_name, and optionally proposed=True when the agent already has a primary grader and you want a non-primary grader for A/B comparison. The response includes grader_agent_id (the new grader workflow), agent_id, version, and message.
The platform appends a fixed “standard rubric” block to your text so model outputs stay in the JSON shape below—you only maintain the custom part of the rubric.
Standard grader output format
The grader model is instructed to return JSON with exactly these top-level fields:| Field | Type | Meaning |
|---|---|---|
pass | boolean | Whether the graded run satisfied the task to your rubric’s satisfaction. |
quality_score | string | One of very-low, low, medium, high, very-high. |
quality_attributes | array | Each element is an object with a single key: the same tier string as quality_score, and the value is a reason string. Multiple entries are allowed, including several objects with the same key. |
GraderOutputExample):
agentserviceapi.models.grader.GraderOutputResult validates this shape. To inspect the example_output shape before you create a grader, use await client.graders.get_agent_rubric(agent_id) — the response includes it even when has_grader is false.
Run a grader on a finished run
After a run completes,await client.graders.execute_grader(agent_id=..., task_id=..., node_id=..., tenant_id=...) starts the grader workflow against that run (optional node_id grades a specific node’s output, for example from non-blocking exit validation). The call returns an ExecuteAgentResponse with the grader’s task_id when a grader run was started, or None if no compatible grader was found.
await client.graders.get_grader_runs_for_agent(agent_id) to list grader runs and read stored grader_result payloads.
Execution feedback
After a run completes (or while reviewing it), record structured notes or simple pass/fail for optimization and analytics. Theagentserviceapi namespaces client.execution_feedback and client.agent_feedback (aliases) expose methods such as:
list_feedback— page through feedback for an agent’s runs (cursor parameters vary by release; see your installed client).save_feedback— attach detailedExecutionFeedbackDatafor atask_id(admin vs user source).save_simple_feedback/get_simple_feedback— binary passed / failed for a run, with optional source.
How this improves workflows
- Tests define “what good looks like” on fixed examples—use them before/after edits to catch regressions.
- Live runs surface real user or operational inputs.
- Feedback (simple or rich) marks which runs succeeded or failed expectations.
- Graders score outputs against your rubric so you can trend quality and compare workflow versions.
client.tests, client.graders, and client.agent_feedback on AgentServiceAPIClient (see Python client). Your workspace may also offer UI or batch eval APIs on top of the same signals.
Related
- Python client
- Output validators
- CLI —
agentservice demofor a sample tool/validator/handler loop