Agent runs — cancellation
A run is one user-turn → assistant-turn cycle on a deployed agent. Cancellation kills an in-flight run mid-stream. This doc covers the contract, the three surfaces (chat shell, API, webhook), the server-side lifecycle, and how cancellations show up in observability.
Why cancel #
Runaway model loops (the agent keeps iterating against MAX_ITERATIONS
because the prompt is ambiguous), accidental expensive tool calls
(gh pr list --json everything --limit 10000), and the trivial
case — the user closed the tab and the runtime is now generating into
the void. Without a kill switch every one of those burns tokens until
natural termination.
What cancellation is and isn't #
The contract, in five lines:
- Best-effort within ~one model-step boundary. In-flight Anthropic
/ MCP / CLI subprocess calls finish naturally; we do not abort the
fetch or
kill -9the tool process. The runtime polls between iterations and exits cleanly when it observes the signal. - Tokens already spent are billed. Same as a runaway-then-completed run. The model already produced the bytes; we don't re-issue credit.
- Idempotent. A second cancel of the same
(org, runId)during the cleanup window is a no-op that returns the original timestamps. - Worst case ~30s of additional generation between cancel-request and runtime-observed-and-exited. The bound is one model step plus the next iteration boundary's poll.
- Bills for the partial run. The
agent_run.cancelledwebhook event carries the actualusage.input/usage.outputtally of what the model produced before exit.
How to cancel #
Three surfaces, same backend.
Chat shell #
Open /agents/[id], send a prompt, and a cancel button appears next
to the streaming assistant message while the SSE stream is live. Click
it and the api fires POST .../cancel for the runId observed on the
SSE started frame. The button disappears once the stream's done
frame lands.
API #
POST /v1/orgs/:slug/agents/:agentId/runs/:runId/cancelAuth: bearer token (CLI / scripts) or session cookie (dashboard). Any member of the org can cancel — a contractor on a free-tier project shouldn't have to wait for an admin to be online while tokens burn.
curl -fsSL -X POST \
"https://api.stech.com/v1/orgs/$ORG/agents/$AGENT/runs/$RUN_ID/cancel" \
-H "Authorization: Bearer $STECH_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "reason": "model kept calling gh tool with bad args" }'
# → 202 Accepted
# {
# "cancelled": true,
# "requestedAt": "2026-05-08T14:09:51.103Z",
# "acknowledgedAt": null,
# "stopReason": null
# }The response is 202 Accepted — the cancel is recorded; the actual
runtime exit is observed asynchronously over the SSE done frame and
then fanned out as the agent_run.cancelled webhook.
Webhook #
Subscribe a webhook endpoint to agent_run.cancelled to receive an
event when a cancel actually lands (not just when one is requested).
Useful for audit pipelines and "who killed what" alerting.
curl -fsSL https://api.stech.com/v1/orgs/$ORG/webhook-endpoints \
-H "Authorization: Bearer $STECH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://hooks.acme.example/stech-cancels",
"description": "log agent_run cancellations",
"events": ["agent_run.cancelled"]
}'Wildcard (["*"]) subscriptions match agent_run.cancelled
automatically.
API reference #
POST /v1/orgs/:slug/agents/:agentId/runs/:runId/cancel #
Request the cancellation of a live run.
| Path param | Notes |
|---|---|
slug |
Org slug — caller must be a member |
agentId |
Deployment id |
runId |
Per-run id surfaced by the SSE started frame at run-stream open time |
Body (optional):
{ "reason": "string, ≤ 500 chars, captured for audit",
"conversationId": "optional — captured for dashboard-mode cancels" }Both fields are optional and the body itself is optional. A malformed
JSON body is tolerated as empty (no 400). reason is trimmed and
truncated to 500 chars.
| Status | Meaning |
|---|---|
202 Accepted |
Cancel recorded (or already recorded, idempotently). Body carries requestedAt + acknowledgedAt. |
401 unauthorized |
No / invalid bearer / session |
403 forbidden |
Caller is not a member of the resolved org |
404 not_found |
Agent doesn't exist, or exists in another org (leak-safely indistinguishable) |
Response body on 202:
{
"cancelled": true,
"requestedAt": "2026-05-08T14:09:51.103Z",
"acknowledgedAt": null,
"stopReason": null
}acknowledgedAt is null until the runtime observes the signal and
posts the internal cancel-acknowledge. stopReason is always null
on this response — the eventual SSE done frame carries the actual
stop_reason ('cancelled' once the runtime exits). Re-polling this
endpoint to learn of the ack isn't necessary; the run-stream's own
done frame is the source of truth, and the
agent_run.cancelled webhook fires with the final timestamps.
A second call for the same (organization_id, run_id) returns the
original requestedAt and whatever acknowledgedAt is in the DB —
idempotency is enforced by a unique index on
agent_run_cancellations(organization_id, run_id).
Where the runId comes from #
The streaming run endpoint emits a started frame as the first SSE
message:
event: message
data: {"type":"started","runId":"run_8w3..."}Capture that runId and feed it into the cancel POST. The chat shell
does this client-side; programmatic callers consuming the SSE stream
should do the same.
Cancellation lifecycle #
What happens server-side once the POST lands:
- API records the request. A row is inserted into
agent_run_cancellationswithorganization_id,agent_id,run_id,requested_by_user_id, and (if supplied)reason/conversation_id. Insert isON CONFLICT (organization_id, run_id) DO NOTHING— idempotent across retries. - Audit row written. An
admin_actionsrow is appended with verbagent_runs.cancel_requested, the actor user, the agent + conversation, and the supplied reason. Visible in/settings/audit?tab=admin&action=agent_runs.cancel_requested. - Runtime polls. Between agent-loop iterations the runtime hits
GET /v1/internal/runs/:runId/cancel-status; the api returns{ cancelled: true }when anagent_run_cancellationsrow exists. Worst case the in-flight model step + tool call finishes before the next poll observes the signal. - Runtime acknowledges. On observed cancel, the runtime posts
POST /v1/internal/runs/:runId/cancel-acknowledge; the api stampsacknowledged_at(idempotent — aWHERE acknowledged_at IS NULLguard). The gap betweenrequested_atandacknowledged_atis the "how long did it take to actually stop" metric. - Runtime exits cleanly. The agent loop returns with
stop_reason='cancelled', the SSE stream emits itsdoneframe carrying the partialfinalText, the run iteration count, and the token usage tally up to cancellation. - API persists + fans out. The api proxy sees the
doneframe, persists the conversation turn withstop_reason='cancelled', and fires theagent_run.cancelledwebhook with the cancellation context (looked up from theagent_run_cancellationsrow at fan-out time). - Run shows in history. The run lands in
/agents/[id]/runswith a graycancelledpill (distinct from greencompletedand redfailed). Failure-rate metrics exclude the cancellation from both numerator and denominator — a spree of user cancellations doesn't trigger the watchdog (see Observability semantics below).
Webhook event payload #
agent_run.cancelled uses the same envelope as every other webhook
(see webhooks.md for id / type /
createdAt / organizationId, signing scheme, retry policy, dedupe
on id). The data shape:
{
"id": "9b2e1c44-15a3-4f0e-91d4-7c8a3f1d2e10",
"type": "agent_run.cancelled",
"createdAt": "2026-05-08T14:10:21.554Z",
"organizationId": "org_2t4b...",
"data": {
"runId": "run_8w3...",
"deploymentId": "dep_4kp...",
"agentName": "support-triage",
"conversationId": "cnv_7m1...",
"stopReason": "cancelled",
"iterations": 2,
"usage": { "input": 4218, "output": 612 },
"cancelledByUserId": "usr_9q1...",
"cancellationReason": "model kept calling gh tool with bad args",
"requestedAt": "2026-05-08T14:09:51.103Z",
"acknowledgedAt": "2026-05-08T14:10:18.220Z"
}
}| Field | Meaning |
|---|---|
runId |
Per-run id (matches the SSE started frame's runId and the cancel-POST path param). |
deploymentId |
Deployment that ran (same as agentId on the cancel route). |
agentName |
Display name, joined from deployments.agent_name. |
conversationId |
Conversation the run belongs to, or null for non-persisted runs. |
stopReason |
The runtime's exit reason — always "cancelled" for this event type. |
iterations |
Tool-call iteration count at exit, or null if the runtime didn't supply one. |
usage.input / usage.output |
Tokens billed for the partial run; either may be null if the runtime didn't tally. |
cancelledByUserId |
The user who POSTed the cancel. null on system-cancels (future feature; today every cancel has a human actor). |
cancellationReason |
The optional reason from the cancel POST body. null when no reason was supplied. |
requestedAt |
When the cancel POST landed. |
acknowledgedAt |
When the runtime observed the signal and stamped the ack. null in the rare race where the runtime exited via stop_reason='cancelled' without going through the ack path. |
finalText is not on this event. The model output up to cancel
lives on the persisted conversation row; the cancellation event is
about the cancel, not the partial reply. Subscribers that want the
text can join conversationId + runId against the conversations
read endpoint.
Observability semantics #
Cancellation is its own bucket — disjoint from completed (a normal
termination) and failed (an unexpected one). It propagates through
every observability surface:
| Surface | How cancellations appear |
|---|---|
Run history pill (/agents/[id]/runs) |
Gray cancelled (vs green completed / red failed) |
Run history filter (?status=...) |
cancelled is a valid token alongside completed and failed; combinable (?status=cancelled,failed) |
Aggregations (totalRuns, failedRuns, cancelledRuns) |
cancelledRuns is its own counter; failedRuns does NOT double-count cancellations |
| Org metrics totals + runs/day chart | Same — cancelledRuns per day, separate from failedRuns |
| Top agents by error rate | Denominator is runs - cancelledRuns; an agent that's mostly being cancelled by users doesn't show as high error rate |
Failure-rate watchdog (audit.flagged with kind: 'agent_failure_rate') |
Cancellations excluded from numerator AND denominator; window-unfilled gate also uses the effective non-cancelled count (a 50-run window with 40 cancellations is effectively a 10-run sample, below noise floor) |
The watchdog payload now includes cancelledCount and
effectiveWindowSize (= windowSize - cancelledCount) alongside the
familiar failedCount / failureRate / windowSize — receivers
that want to surface "10 fails out of 20 consequential runs (50
sampled)" have the numbers.
See observability.md for the full metrics
catalog, percentile semantics, and the failure-rate watchdog's
audit.flagged event shape.
Edge cases #
- Cancel a run that already finished. Returns 202 (idempotent), but
has no actual runtime effect — the run already terminated. The
cancellation row is still inserted for audit, and the audit
admin_actionsrow is still written. - Cancel during a long tool call. The tool call (gh / mcp / Anthropic API call) completes — we don't kill subprocesses or abort the fetch. The cancel is observed at the next iteration boundary, before the model would have started its next step.
- Network blip during the cancel POST. Safe to retry. The
(organization_id, run_id)unique constraint makes the second insert a no-op; the response carries the original timestamps from the first attempt. - Cancel from a different user than the one who triggered the run. Allowed by design. The cancel route is org-scoped, not run-owner-scoped — a teammate observing a misbehaving run shouldn't be blocked from killing it.
conversationIdfrom another user's conversation. The cancel still proceeds; the field is dropped silently rather than 404'd. A bad-faith caller can't enumerate conversation ids by observing acceptance / rejection. The runtime gets less audit context but the cancel itself lands.
Out of scope (deferred) #
- Hard-kill — sending
SIGTERMmid-tool-call. Too disruptive (half-applied filesystem changes, half-finished MCP transactions); deferred until customers ask. - Per-tool-call cancel — "stop running this gh command, but keep the agent loop going". Finer-grained surface; deferred.
- Bulk cancel — "cancel all runs for agent X". Useful during an outage; deferred.
- Cost-cap enforcement — auto-cancel when a run exceeds $X spent. Separate epic; the cancel API is the primitive it would build on.
Troubleshooting #
I clicked cancel and the run kept running for 20 seconds.
Expected. The in-flight model step (or tool subprocess) finishes
before the runtime observes the signal between iterations. Worst case
~30s. The acknowledgedAt on the cancel response and on the
agent_run.cancelled webhook tells you when the runtime actually
observed the cancel.
Cancel button doesn't appear in the chat shell. It's only visible while the SSE stream is live. If you opened the page after the run completed, the button won't show — there's nothing to cancel.
I see stop_reason='cancelled' but no audit row. The audit row
lives in admin_actions with verb agent_runs.cancel_requested. Check
/settings/audit?tab=admin&action=agent_runs.cancel_requested. The
verb is on the cancel-request side; the runtime's stop_reason is the
cancel-observed side. They're separate writes and either could lag.
Webhook never fired. Check, in order: (a) the endpoint subscribes
to agent_run.cancelled (or *); (b) the endpoint is enabled = true; (c) the delivery log at /settings/audit?tab=webhooks shows
the attempt (it'll be there even on receiver-side failure). The
event fires off the SSE done frame, not off the cancel POST — if
the runtime didn't reach a clean exit (rare race), the event won't
fire.
Failure-rate alert fired even though the recent runs were all
cancellations. Bug — cancellations should be excluded from both
numerator and denominator. Check the audit.flagged event's data:
failedCount should NOT include cancellations, and cancelledCount
should be populated. If cancelledCount is missing or failedCount
includes runs whose stop_reason='cancelled', file an issue with the
event id.
cancelledByUserId is null on the webhook. Either the cancel was
issued by a system path (future feature; not yet shipped — should not
happen today), or the cancellation row was deleted between the SSE
done frame and the webhook fan-out (extremely rare; the row is
written before the runtime is ever told). Check the audit log for the
matching agent_runs.cancel_requested row.
Related #
- Observability — full metrics catalog, the
cancelledstatus pill in the run history, the failure-rate watchdog (now cancellation-aware). - Webhooks —
agent_run.cancelledenvelope, signing scheme, retry policy, dedupe onevent.id. - Audit log — every cancel-request lands in
admin_actionswith verbagent_runs.cancel_requested; visible at/settings/audit?tab=admin&action=agent_runs.cancel_requested. - Policy and guardrails — the parallel
blocked:<kind>terminal status bucket. Same disjoint-from-failed posture and same dedicated webhook channel (agent_run.blocked).