all systems operationalv0.17.10
stech/

Agent runs — cancellation

A run is one user-turn → assistant-turn cycle on a deployed agent. Cancellation kills an in-flight run mid-stream. This doc covers the contract, the three surfaces (chat shell, API, webhook), the server-side lifecycle, and how cancellations show up in observability.

Why cancel #

Runaway model loops (the agent keeps iterating against MAX_ITERATIONS because the prompt is ambiguous), accidental expensive tool calls (gh pr list --json everything --limit 10000), and the trivial case — the user closed the tab and the runtime is now generating into the void. Without a kill switch every one of those burns tokens until natural termination.

What cancellation is and isn't #

The contract, in five lines:

  • Best-effort within ~one model-step boundary. In-flight Anthropic / MCP / CLI subprocess calls finish naturally; we do not abort the fetch or kill -9 the tool process. The runtime polls between iterations and exits cleanly when it observes the signal.
  • Tokens already spent are billed. Same as a runaway-then-completed run. The model already produced the bytes; we don't re-issue credit.
  • Idempotent. A second cancel of the same (org, runId) during the cleanup window is a no-op that returns the original timestamps.
  • Worst case ~30s of additional generation between cancel-request and runtime-observed-and-exited. The bound is one model step plus the next iteration boundary's poll.
  • Bills for the partial run. The agent_run.cancelled webhook event carries the actual usage.input / usage.output tally of what the model produced before exit.

How to cancel #

Three surfaces, same backend.

Chat shell #

Open /agents/[id], send a prompt, and a cancel button appears next to the streaming assistant message while the SSE stream is live. Click it and the api fires POST .../cancel for the runId observed on the SSE started frame. The button disappears once the stream's done frame lands.

API #

POST /v1/orgs/:slug/agents/:agentId/runs/:runId/cancel

Auth: bearer token (CLI / scripts) or session cookie (dashboard). Any member of the org can cancel — a contractor on a free-tier project shouldn't have to wait for an admin to be online while tokens burn.

curl -fsSL -X POST \
  "https://api.stech.com/v1/orgs/$ORG/agents/$AGENT/runs/$RUN_ID/cancel" \
  -H "Authorization: Bearer $STECH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "reason": "model kept calling gh tool with bad args" }'
# → 202 Accepted
# {
#   "cancelled": true,
#   "requestedAt": "2026-05-08T14:09:51.103Z",
#   "acknowledgedAt": null,
#   "stopReason": null
# }

The response is 202 Accepted — the cancel is recorded; the actual runtime exit is observed asynchronously over the SSE done frame and then fanned out as the agent_run.cancelled webhook.

Webhook #

Subscribe a webhook endpoint to agent_run.cancelled to receive an event when a cancel actually lands (not just when one is requested). Useful for audit pipelines and "who killed what" alerting.

curl -fsSL https://api.stech.com/v1/orgs/$ORG/webhook-endpoints \
  -H "Authorization: Bearer $STECH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://hooks.acme.example/stech-cancels",
    "description": "log agent_run cancellations",
    "events": ["agent_run.cancelled"]
  }'

Wildcard (["*"]) subscriptions match agent_run.cancelled automatically.

API reference #

POST /v1/orgs/:slug/agents/:agentId/runs/:runId/cancel #

Request the cancellation of a live run.

Path param Notes
slug Org slug — caller must be a member
agentId Deployment id
runId Per-run id surfaced by the SSE started frame at run-stream open time

Body (optional):

{ "reason": "string, ≤ 500 chars, captured for audit",
  "conversationId": "optional — captured for dashboard-mode cancels" }

Both fields are optional and the body itself is optional. A malformed JSON body is tolerated as empty (no 400). reason is trimmed and truncated to 500 chars.

Status Meaning
202 Accepted Cancel recorded (or already recorded, idempotently). Body carries requestedAt + acknowledgedAt.
401 unauthorized No / invalid bearer / session
403 forbidden Caller is not a member of the resolved org
404 not_found Agent doesn't exist, or exists in another org (leak-safely indistinguishable)

Response body on 202:

{
  "cancelled": true,
  "requestedAt": "2026-05-08T14:09:51.103Z",
  "acknowledgedAt": null,
  "stopReason": null
}

acknowledgedAt is null until the runtime observes the signal and posts the internal cancel-acknowledge. stopReason is always null on this response — the eventual SSE done frame carries the actual stop_reason ('cancelled' once the runtime exits). Re-polling this endpoint to learn of the ack isn't necessary; the run-stream's own done frame is the source of truth, and the agent_run.cancelled webhook fires with the final timestamps.

A second call for the same (organization_id, run_id) returns the original requestedAt and whatever acknowledgedAt is in the DB — idempotency is enforced by a unique index on agent_run_cancellations(organization_id, run_id).

Where the runId comes from #

The streaming run endpoint emits a started frame as the first SSE message:

event: message
data: {"type":"started","runId":"run_8w3..."}

Capture that runId and feed it into the cancel POST. The chat shell does this client-side; programmatic callers consuming the SSE stream should do the same.

Cancellation lifecycle #

What happens server-side once the POST lands:

  1. API records the request. A row is inserted into agent_run_cancellations with organization_id, agent_id, run_id, requested_by_user_id, and (if supplied) reason / conversation_id. Insert is ON CONFLICT (organization_id, run_id) DO NOTHING — idempotent across retries.
  2. Audit row written. An admin_actions row is appended with verb agent_runs.cancel_requested, the actor user, the agent + conversation, and the supplied reason. Visible in /settings/audit?tab=admin&action=agent_runs.cancel_requested.
  3. Runtime polls. Between agent-loop iterations the runtime hits GET /v1/internal/runs/:runId/cancel-status; the api returns { cancelled: true } when an agent_run_cancellations row exists. Worst case the in-flight model step + tool call finishes before the next poll observes the signal.
  4. Runtime acknowledges. On observed cancel, the runtime posts POST /v1/internal/runs/:runId/cancel-acknowledge; the api stamps acknowledged_at (idempotent — a WHERE acknowledged_at IS NULL guard). The gap between requested_at and acknowledged_at is the "how long did it take to actually stop" metric.
  5. Runtime exits cleanly. The agent loop returns with stop_reason='cancelled', the SSE stream emits its done frame carrying the partial finalText, the run iteration count, and the token usage tally up to cancellation.
  6. API persists + fans out. The api proxy sees the done frame, persists the conversation turn with stop_reason='cancelled', and fires the agent_run.cancelled webhook with the cancellation context (looked up from the agent_run_cancellations row at fan-out time).
  7. Run shows in history. The run lands in /agents/[id]/runs with a gray cancelled pill (distinct from green completed and red failed). Failure-rate metrics exclude the cancellation from both numerator and denominator — a spree of user cancellations doesn't trigger the watchdog (see Observability semantics below).

Webhook event payload #

agent_run.cancelled uses the same envelope as every other webhook (see webhooks.md for id / type / createdAt / organizationId, signing scheme, retry policy, dedupe on id). The data shape:

{
  "id": "9b2e1c44-15a3-4f0e-91d4-7c8a3f1d2e10",
  "type": "agent_run.cancelled",
  "createdAt": "2026-05-08T14:10:21.554Z",
  "organizationId": "org_2t4b...",
  "data": {
    "runId": "run_8w3...",
    "deploymentId": "dep_4kp...",
    "agentName": "support-triage",
    "conversationId": "cnv_7m1...",
    "stopReason": "cancelled",
    "iterations": 2,
    "usage": { "input": 4218, "output": 612 },
    "cancelledByUserId": "usr_9q1...",
    "cancellationReason": "model kept calling gh tool with bad args",
    "requestedAt": "2026-05-08T14:09:51.103Z",
    "acknowledgedAt": "2026-05-08T14:10:18.220Z"
  }
}
Field Meaning
runId Per-run id (matches the SSE started frame's runId and the cancel-POST path param).
deploymentId Deployment that ran (same as agentId on the cancel route).
agentName Display name, joined from deployments.agent_name.
conversationId Conversation the run belongs to, or null for non-persisted runs.
stopReason The runtime's exit reason — always "cancelled" for this event type.
iterations Tool-call iteration count at exit, or null if the runtime didn't supply one.
usage.input / usage.output Tokens billed for the partial run; either may be null if the runtime didn't tally.
cancelledByUserId The user who POSTed the cancel. null on system-cancels (future feature; today every cancel has a human actor).
cancellationReason The optional reason from the cancel POST body. null when no reason was supplied.
requestedAt When the cancel POST landed.
acknowledgedAt When the runtime observed the signal and stamped the ack. null in the rare race where the runtime exited via stop_reason='cancelled' without going through the ack path.

finalText is not on this event. The model output up to cancel lives on the persisted conversation row; the cancellation event is about the cancel, not the partial reply. Subscribers that want the text can join conversationId + runId against the conversations read endpoint.

Observability semantics #

Cancellation is its own bucket — disjoint from completed (a normal termination) and failed (an unexpected one). It propagates through every observability surface:

Surface How cancellations appear
Run history pill (/agents/[id]/runs) Gray cancelled (vs green completed / red failed)
Run history filter (?status=...) cancelled is a valid token alongside completed and failed; combinable (?status=cancelled,failed)
Aggregations (totalRuns, failedRuns, cancelledRuns) cancelledRuns is its own counter; failedRuns does NOT double-count cancellations
Org metrics totals + runs/day chart Same — cancelledRuns per day, separate from failedRuns
Top agents by error rate Denominator is runs - cancelledRuns; an agent that's mostly being cancelled by users doesn't show as high error rate
Failure-rate watchdog (audit.flagged with kind: 'agent_failure_rate') Cancellations excluded from numerator AND denominator; window-unfilled gate also uses the effective non-cancelled count (a 50-run window with 40 cancellations is effectively a 10-run sample, below noise floor)

The watchdog payload now includes cancelledCount and effectiveWindowSize (= windowSize - cancelledCount) alongside the familiar failedCount / failureRate / windowSize — receivers that want to surface "10 fails out of 20 consequential runs (50 sampled)" have the numbers.

See observability.md for the full metrics catalog, percentile semantics, and the failure-rate watchdog's audit.flagged event shape.

Edge cases #

  • Cancel a run that already finished. Returns 202 (idempotent), but has no actual runtime effect — the run already terminated. The cancellation row is still inserted for audit, and the audit admin_actions row is still written.
  • Cancel during a long tool call. The tool call (gh / mcp / Anthropic API call) completes — we don't kill subprocesses or abort the fetch. The cancel is observed at the next iteration boundary, before the model would have started its next step.
  • Network blip during the cancel POST. Safe to retry. The (organization_id, run_id) unique constraint makes the second insert a no-op; the response carries the original timestamps from the first attempt.
  • Cancel from a different user than the one who triggered the run. Allowed by design. The cancel route is org-scoped, not run-owner-scoped — a teammate observing a misbehaving run shouldn't be blocked from killing it.
  • conversationId from another user's conversation. The cancel still proceeds; the field is dropped silently rather than 404'd. A bad-faith caller can't enumerate conversation ids by observing acceptance / rejection. The runtime gets less audit context but the cancel itself lands.

Out of scope (deferred) #

  • Hard-kill — sending SIGTERM mid-tool-call. Too disruptive (half-applied filesystem changes, half-finished MCP transactions); deferred until customers ask.
  • Per-tool-call cancel — "stop running this gh command, but keep the agent loop going". Finer-grained surface; deferred.
  • Bulk cancel — "cancel all runs for agent X". Useful during an outage; deferred.
  • Cost-cap enforcement — auto-cancel when a run exceeds $X spent. Separate epic; the cancel API is the primitive it would build on.

Troubleshooting #

I clicked cancel and the run kept running for 20 seconds. Expected. The in-flight model step (or tool subprocess) finishes before the runtime observes the signal between iterations. Worst case ~30s. The acknowledgedAt on the cancel response and on the agent_run.cancelled webhook tells you when the runtime actually observed the cancel.

Cancel button doesn't appear in the chat shell. It's only visible while the SSE stream is live. If you opened the page after the run completed, the button won't show — there's nothing to cancel.

I see stop_reason='cancelled' but no audit row. The audit row lives in admin_actions with verb agent_runs.cancel_requested. Check /settings/audit?tab=admin&action=agent_runs.cancel_requested. The verb is on the cancel-request side; the runtime's stop_reason is the cancel-observed side. They're separate writes and either could lag.

Webhook never fired. Check, in order: (a) the endpoint subscribes to agent_run.cancelled (or *); (b) the endpoint is enabled = true; (c) the delivery log at /settings/audit?tab=webhooks shows the attempt (it'll be there even on receiver-side failure). The event fires off the SSE done frame, not off the cancel POST — if the runtime didn't reach a clean exit (rare race), the event won't fire.

Failure-rate alert fired even though the recent runs were all cancellations. Bug — cancellations should be excluded from both numerator and denominator. Check the audit.flagged event's data: failedCount should NOT include cancellations, and cancelledCount should be populated. If cancelledCount is missing or failedCount includes runs whose stop_reason='cancelled', file an issue with the event id.

cancelledByUserId is null on the webhook. Either the cancel was issued by a system path (future feature; not yet shipped — should not happen today), or the cancellation row was deleted between the SSE done frame and the webhook fan-out (extremely rare; the row is written before the runtime is ever told). Check the audit log for the matching agent_runs.cancel_requested row.

  • Observability — full metrics catalog, the cancelled status pill in the run history, the failure-rate watchdog (now cancellation-aware).
  • Webhooksagent_run.cancelled envelope, signing scheme, retry policy, dedupe on event.id.
  • Audit log — every cancel-request lands in admin_actions with verb agent_runs.cancel_requested; visible at /settings/audit?tab=admin&action=agent_runs.cancel_requested.
  • Policy and guardrails — the parallel blocked:<kind> terminal status bucket. Same disjoint-from-failed posture and same dedicated webhook channel (agent_run.blocked).