Policy and guardrails
Agent authors declare guardrails as an array of strings on
defineAgent(). The platform parses them at deploy time, refuses
unknown shapes with a 400, and enforces every parsed entry in both
the dev runtime and the cloud runtime. A run that fires a guardrail
terminates with a blocked:<kind> stop reason, a structured blocked
envelope on the response, a row in the guardrail_violations audit
table, an audit.flagged webhook, and a parallel agent_run.blocked
webhook on the dedicated SIEM channel.
This page is the catalog: what each guardrail does, where it fires, the wire shape of a block, and how to subscribe to or query the audit trail.
Authoring #
Guardrails live on the AgentConfig object passed to defineAgent():
import { defineAgent, tool } from "@stech/sdk";
export default defineAgent({
name: "support-triage",
model: "claude-sonnet-4-5",
instructions: "Help users triage open tickets.",
tools: [
tool("ticket.lookup", ticketSchema),
tool("crm.lookup", crmSchema),
],
guardrails: [
"pii.redact",
"rate:10/min",
"max_tokens=4096",
"input_max_chars=8000",
"require_tool_allowlist=ticket.lookup,crm.lookup",
],
});The strings parse into a closed union (see cli/src/sdk/guardrails.ts).
Adding a new kind is a four-step coordinated change: extend the union,
add a parser case, add checker functions in both runtimes, and
document it here. Custom custom:<name> opaque shapes are
deliberately refused for v1 — letting an author smuggle a policy
claim that the runtime can't actually enforce is worse than refusing
the unknown string outright.
Validation #
Two gates catch malformed guardrail strings before they reach the cloud runtime:
stech devboot.AgentRunnerconstruction callsinstallGuardrails()and hard-fails on bad-config strings (e.g.rate:10/foobar,max_tokens=-1). It warns and continues on unknown-kind strings (e.g.pii.shred) so a future SDK shape added in cloud doesn't brick a stale local checkout.stech deploy. The api validates theguardrailsform field with the SDK-mirrored parser and refuses the deploy with400 invalid_guardrailsplus adetaillisting every bad string and the full menu of accepted shapes.
The mirror tests (api/tests/guardrails.test.ts,
runtime/tests/guardrails-mirror.test.ts) pin the api + runtime
parsers to the SDK parser on a representative happy + sad input set
so the three sides cannot drift.
Catalog #
Eight guardrail kinds in v1. Every one has a string syntax, a
runtime seam, a BlockedDetails shape, and a worked example.
pii.redact #
Strip emails, US SSNs, and US/E.164 phone numbers from user prompts before they reach the model. Regex-based — fast, deterministic, with a documented false-positive cost (matches inside code samples or API docs the agent is summarizing).
| Field | Value |
|---|---|
| String form | pii.redact |
| Seam | user-prompt entry, before message append |
| Action | rewrites the prompt in place; does not block the run |
BlockedDetails |
n/a — this guardrail observes, it doesn't terminate |
| Audit signal | runtime emits a guardrail SSE event (no audit row, no webhook) |
guardrails: ["pii.redact"]A prompt of email me at [email protected] becomes
email me at [REDACTED:email] before the model sees it. The original
text is not persisted anywhere — once redacted, the cleaned prompt
is what lands in agent_messages.user_text.
LLM-judged redaction (more accurate, more expensive) is a future opt-in shape; the regex catches the common cases and is cheap enough to run on every turn.
rate:N/UNIT #
Cap ask() calls to N per rolling UNIT window. Per
AgentRunner instance in dev; per runtime machine in cloud.
| Field | Value |
|---|---|
| String form | rate:N/{sec,min,hour} |
| Seam | per-iteration boundary, before the first provider.generate() |
| Action | terminates with stopReason='blocked:rate' |
limit |
configured N |
observed |
actual in-window request count at trip time |
guardrails: ["rate:10/min", "rate:100/hour"]Multiple rate: entries stack — the strictest one trips first.
Distributed limitation. A customer running multiple Fly machines for one deployment gets N times the configured budget — each machine maintains its own bucket. Redis-backed distributed rate limiting is filed for a follow-up; the in-memory v1 is a per-machine ceiling, not a per-org one.
max_tokens=N #
Per-run cumulative output-token ceiling. Two checks:
- Pre-iteration: clamp the provider's
max_tokensrequest tomin(provider_default, N - cumulative_output). - Post-iteration:
recordUsage()checkscumulative_output > Nand terminates if a provider returned more than the clamped request.
| Field | Value |
|---|---|
| String form | max_tokens=N (positive integer) |
| Seam | both pre-call (clamp) and post-call (verify) |
| Action | terminates with stopReason='blocked:max_tokens' |
limit |
configured N |
observed |
cumulative output tokens at trip time |
guardrails: ["max_tokens=4096"]The post-iteration check is defense in depth — Anthropic clamps internally, but a future provider may not.
max_cost=N #
Per-run USD spend ceiling, in micro-cents (1_000_000 micros ==
1 cent == $0.01). Tracked via the runtime's recordCost() seam.
Cost is computed by the caller from usage + a model-pricing table —
the runtime doesn't know prices.
| Field | Value |
|---|---|
| String form | max_cost=N (positive integer micros) |
| Seam | per-iteration recordCost() |
| Action | terminates with stopReason='blocked:max_cost' |
limit |
configured N (micros) |
observed |
cumulative micros at trip time |
guardrails: ["max_cost=10000"]
// → 10000 micros == 0.01 cent == cap of $0.0001 per runThe micro-cent unit avoids floating-point drift on multi-iteration
runs and matches the resolution of the billing aggregator's
usage_records table.
block_models=PAT,... #
Refuse to even start the run if agent.model matches any pattern.
Patterns are simple anchored globs — only * is supported as a
wildcard, and the match is against the full model string.
claude-* does NOT match my-claude-3 because both ends are
anchored; metachars like . are escaped before compilation so a
stray gpt-4.0 cannot widen to gpt-410.
| Field | Value |
|---|---|
| String form | block_models=pat1,pat2,... |
| Seam | run boot, before provider construction |
| Action | terminates with stopReason='blocked:block_models' |
limit |
null (the trigger is a match, not a numeric ceiling) |
observed |
the offending model string |
guardrails: ["block_models=gpt-3.5*,claude-2*"]Use case: an org policy clamps which model families the agent can
target. Forward-compat with PR-3 of the epic, where the same
block_models= shape is the surface an admin uses to ban model
families across every agent in the org.
require_tool_allowlist=tool_a,tool_b #
Only allow tool calls whose name appears in the comma-separated
allowlist. Other tool calls return an is_error tool_result block
with a "blocked by policy" message and the model can recover —
a refused tool does NOT terminate the run, matching how unknown-
tool errors are handled today.
| Field | Value |
|---|---|
| String form | require_tool_allowlist=tool_a,tool_b,... |
| Seam | tool dispatch, per call |
| Action | refused tool returns an is_error tool_result; run continues |
| Audit signal | runtime emits a guardrail SSE event per refusal |
guardrails: ["require_tool_allowlist=ticket.lookup,crm.lookup"]The model sees the refusal in the next turn's tool-result block and can choose to continue without the blocked tool, retry with a different name, or surrender by returning text.
input_max_chars=N #
Refuse user prompts longer than N characters. Cheap defense
against accidental prompt-bombing. The check runs before the
pii.redact regex pass, so a 50KB prompt doesn't waste a regex pass
before being rejected anyway.
| Field | Value |
|---|---|
| String form | input_max_chars=N (positive integer) |
| Seam | user-prompt entry, before message append |
| Action | terminates with stopReason='blocked:input_max_chars' |
limit |
configured N |
observed |
actual prompt length |
guardrails: ["input_max_chars=8000"]output_max_chars=N #
Truncate / abort runs whose final assistant text exceeds N
characters. Checked at the iteration end, after the model produced
its text but before the loop returns the RunResult.
| Field | Value |
|---|---|
| String form | output_max_chars=N (positive integer) |
| Seam | iteration end, after extractFinalText |
| Action | terminates with stopReason='blocked:output_max_chars' |
limit |
configured N |
observed |
actual final-text length |
guardrails: ["output_max_chars=12000"]Wire shape of a block #
When a guardrail fires, the run terminates cleanly with a structured
envelope on both the synchronous /run JSON response and the
streaming /run-stream done SSE frame.
Synchronous /run:
{
"runId": "run_8w3...",
"stopReason": "blocked:max_tokens",
"finalText": "",
"iterations": 2,
"usage": { "input": 4218, "output": 4521 },
"blocked": {
"guardrail": "max_tokens",
"limit": 4096,
"observed": 4521,
"source": "agent",
"message": "cumulative output 4521 tokens > guardrail max_tokens=4096"
}
}Streaming /run-stream done frame:
{
"type": "done",
"runId": "run_8w3...",
"stopReason": "blocked:max_tokens",
"finalText": "",
"iterations": 2,
"usage": { "input": 4218, "output": 4521 },
"messages": [ /* ... */ ],
"blocked": {
"guardrail": "max_tokens",
"limit": 4096,
"observed": 4521,
"source": "agent",
"message": "cumulative output 4521 tokens > guardrail max_tokens=4096"
}
}The blocked key is omitted when the run is not blocked — wire
shape stable for the normal path. Receivers that want to detect a
block branch on blocked != null, not on a stop-reason prefix.
source is "agent" for v1 — every guardrail in v1 originates from
the agent's declared guardrails array. The field exists today so a
future org-policy override surface widens to "org_policy" without
a wire-shape change.
limit is null for guardrails where the trigger is a match, not a
numeric ceiling (pii.redact, block_models).
observed is heterogeneous: number for size/count checks (rate,
max_tokens, input_max_chars, output_max_chars, max_cost), string
for name-match checks (block_models, require_tool_allowlist), null
for boolean-style triggers.
Dashboard #
/settings/guardrails (org-admin link in the settings sidebar).
Server-rendered table of every blocked run in the org, with a stat
strip + per-agent + per-guardrail filters + cursor-paginated
"load more". Same chrome as the audit page.
The table sources from the guardrail_violations audit table — one
row per terminal block — written by the api proxy in the same code
path that fans out the agent_run.completed/failed/cancelled
webhooks. The writer is fire-and-forget: a write failure logs and
swallows; the run is unaffected.
API #
GET /v1/orgs/:slug/guardrail-violations #
Paginated org-scoped list of guardrail violations. Same auth posture as the audit-log viewer (any org member; product policy is full audit transparency).
| Query param | Type | Default | Meaning |
|---|---|---|---|
limit |
int | 50 | Page size, clamped to [1, 200] |
cursor |
base64 string | none | Opaque cursor from a prior nextCursor |
agentId |
string | none | Narrow to one deployment |
guardrail |
string | none | Narrow to one guardrail kind (max_tokens, rate, …) |
Response:
{
"violations": [
{
"id": "gv_2k3...",
"agentId": "dep_4kp...",
"agentName": "support-triage",
"runId": "run_8w3...",
"guardrailKind": "max_tokens",
"source": "agent",
"limit": "4096",
"observed": 4521,
"message": "cumulative output 4521 tokens > guardrail max_tokens=4096",
"stopReason": "blocked:max_tokens",
"occurredAt": "2026-05-08T14:09:51.103Z"
}
// ...
],
"nextCursor": "eyJ0cyI6IjIwMjYtMDUtMDhUMTQ6MDk6NTEuMTAzWiIsImlkIjoiZ3ZfMmszIn0",
"aggregations": {
"total": 47,
"byGuardrail": [
{ "guardrail": "max_tokens", "count": 31 },
{ "guardrail": "rate", "count": 12 },
{ "guardrail": "block_models", "count": 4 }
]
}
}nextCursor is null when there are no more rows. limit is a
string on the wire (the audit table stores it as text to keep
heterogeneous values in one column); coerce with parseInt if you
need a number. observed is number | string | null.
curl -fsSL "https://api.stech.com/v1/orgs/$ORG/guardrail-violations?guardrail=max_tokens&limit=100" \
-H "Authorization: Bearer $STECH_API_KEY"GET /v1/orgs/:slug/agent-runs/metrics #
The org-metrics endpoint (see observability.md)
adds a blockedRuns total + a topGuardrailsByBlocks breakdown so
admins can answer "which policy fires the most this week" without
hitting the violations route directly.
{
"totals": {
"runs": 4218,
"failedRuns": 47,
"cancelledRuns": 12,
"blockedRuns": 31,
"inputTokens": 18223451,
"outputTokens": 2811042
},
"topGuardrailsByBlocks": [
{ "guardrail": "max_tokens", "blocks": 18 },
{ "guardrail": "rate", "blocks": 8 },
{ "guardrail": "block_models", "blocks": 5 }
]
}?status=blocked on the run history #
/v1/orgs/:slug/agents/:id/runs?status=blocked filters to blocked
runs only. The status field on each row is one of completed,
failed, cancelled, blocked — blocked is disjoint from
the other three, and the failedRuns aggregation excludes blocks.
Webhook events #
Two events fire per blocked run, with identical data payloads.
Pick whichever channel matches your subscriber:
audit.flaggedwithdata.kind = "agent_run_blocked"— aggregate alerting channel. Subscribers already filter onaudit.flaggedfor the failure-rate watchdog (see observability.md) and the SCIM admin alerts; blocked runs slot in behind thedata.kinddiscriminator.agent_run.blocked— dedicated SIEM channel. For consumers that don't want everyaudit.flagged(which mixes blocked-runs with admin actions + watchdog alerts).
A subscriber to both channels gets two deliveries per blocked
run with identical data bytes, identical createdAt, distinct
event_id. That is the documented design — belt-and-braces SIEM
feeds are intentional, not a bug.
The data payload, identical on both events:
{
"id": "1f2e3d4c-5b6a-7889-99aa-bbccddeeff00",
"type": "agent_run.blocked",
"createdAt": "2026-05-08T14:09:51.103Z",
"organizationId": "org_2t4b...",
"data": {
"kind": "agent_run_blocked",
"agentId": "dep_4kp...",
"agentName": "support-triage",
"runId": "run_8w3...",
"conversationId": "cnv_7m1...",
"guardrail": "max_tokens",
"source": "agent",
"limit": 4096,
"observed": 4521,
"message": "cumulative output 4521 tokens > guardrail max_tokens=4096",
"stopReason": "blocked:max_tokens"
}
}limit is a number on the webhook payload (the api projects
it from the runtime envelope's typed BlockedDetails); observed
is number | string | null.
Subscribing #
Two endpoint configs — one for each channel:
# Dedicated SIEM channel — only blocked runs.
curl -fsSL https://api.stech.com/v1/orgs/$ORG/webhook-endpoints \
-H "Authorization: Bearer $STECH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://hooks.acme.example/stech-blocks",
"description": "agent_run.blocked → siem",
"events": ["agent_run.blocked"]
}'
# Aggregate channel — failure-rate watchdog + admin actions + blocks.
# Branch on data.kind to route blocks vs other audit.flagged kinds.
curl -fsSL https://api.stech.com/v1/orgs/$ORG/webhook-endpoints \
-H "Authorization: Bearer $STECH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://hooks.acme.example/stech-audit",
"description": "audit.flagged → slack",
"events": ["audit.flagged"]
}'Wildcard (["*"]) subscriptions match agent_run.blocked and
audit.flagged automatically.
The full create / verify / rotate flow is in
webhooks.md — signing scheme, signed-body
verification, retry policy, dedupe on event.id.
Debugging a blocked run #
Three surfaces, three answers to "why did this run terminate".
The run itself. The synchronous /run JSON or the streaming
done SSE frame carries the full blocked envelope inline. The
message field is the human-readable explanation; guardrail +
limit + observed are the structured fields a dashboard renders.
The audit table. GET /v1/orgs/:slug/guardrail-violations lists
every block in the org, filterable by agent + guardrail kind. Useful
when the runtime caller already disconnected and you want to know
"what blocked yesterday afternoon's runs".
The dashboard. /settings/guardrails is the same data with
chrome — stat strip, agent filter, load-more.
If the run does not appear in any of those:
- Did the runtime fire
done? A runtime crash mid-stream never terminates with ablocked:reason — it just dies. Check the api logs for[agents] stream persist failedlines. - Did the parser refuse the guardrail at deploy time? A
400 invalid_guardrailsfromstech deploymeans the runtime never received the policy. Re-run the deploy and read thedetailarray. - Is the dev runtime catching it but the cloud isn't? The dev
runtime hard-fails on bad-config strings at boot; the cloud
warns and drops them so a stale config slips through doesn't
brick a user-visible agent. Grep the runtime logs for
[guardrails] dropping bad-config string.
Failure-rate exclusion #
A blocked run is a policy success, not an agent quality failure.
The failure-rate watchdog in api/src/lib/agent-failure-alert.ts
excludes blocked runs from both numerator and denominator (same
shape as cancelled-run exclusion in #289 PR-3). The
AgentFailureAlertPayload includes a blockedCount field so the
alert receiver can see "the agent is at 22% failure rate AND has 31
blocks this window" if both signals are firing simultaneously.
failedExpr in api/src/routes/agent-runs.ts classifies any
stop_reason starting with blocked: as not failed. Any new
guardrail kind added to the catalog is automatically excluded — the
prefix match is over-conservative on purpose so the runtime can ship
new kinds before the api knows about them.
v1 limitations #
- In-memory rate limiting. A customer running multiple Fly machines for one deployment gets N times the configured rate budget — each machine maintains its own bucket. Redis-backed distributed rate limiting is filed for a follow-up.
- Regex-based PII redaction. Catches the common shapes (email, US SSN, US/E.164 phone) but has false positives inside code samples or API docs the agent is summarizing. LLM-judged redaction is a future opt-in shape.
- No org-policy override. Guardrails today are agent-author-
declared only. An org-admin override surface that clamps what a
less-trusted author can opt out of is filed for a follow-up; the
source: "agent" | "org_policy"field on the wire shape is forward-compat for it. - No LLM-judged guardrails. Using a small model to detect prompt injection / jailbreak attempts is a separate epic — real value, but expensive (per-call inference cost) and complicated to make deterministic.
- No per-tool guardrails. "block specific tool calls
(no
gh repo delete)" is a tool-policy concern; the seam isdispatchTool, not the policy engine. Filed separately. - No custom user-authored guardrail functions. An extension surface where customers ship code that runs in our runtime is a v2 concern after we know what customers actually want to extend.
max_costrequires caller-side pricing. The runtime doesn't know model prices. TherecordCost()seam is wired but no callers feed it yet — a future PR adds a model-pricing table and wires the per-iteration cost computation.
Related #
- Agent runs — cancellation — the parallel
status-bucket pattern for cancellations. Same exclusion shape on
the failure-rate watchdog; same
agent_run.cancelled/agent_run.blockeddual-event posture. - Observability —
topGuardrailsByBlocksaggregation,?status=blockedfilter, the failure-rate watchdog's blocked-exclusion behavior. - Webhooks —
audit.flaggedenvelope, signing scheme, retry policy. Theagent_run_blockedpayload above is one of the curatedaudit.flaggeddata.kindvalues. - Audit log — the broader audit surface; the
guardrail_violationstable is a per-block audit trail that cross-references with the audit log viaagentId+runId.