Observability
Customers running agents in production need to answer three questions:
is it working, how fast is it, how much is it spending. The
observability surface is a read-side view over agent_messages — every
assistant turn the runtime persisted — aggregated three ways: per-org,
per-agent, and per-run.
No new tables, no event stream to subscribe to for metrics; the data is rolled up at query time from the same rows the chat shell already shows.
Where to look #
Three dashboard surfaces, three matching API endpoints.
| Dashboard | API | What it shows |
|---|---|---|
/agents |
GET /v1/orgs/:slug/agent-runs/metrics |
Org-wide last-30d runs/day chart, top 5 agents by activity, top 5 by error rate, totals |
/agents/[id] |
GET /v1/orgs/:slug/agents/:id/metrics |
Per-agent last-7d run count, failure rate, p95 latency strip above the chat |
/agents/[id]/runs |
GET /v1/orgs/:slug/agents/:id/runs |
Every run for the agent, filterable + paginated, click-through to the conversation |
Each row of the run history table links to the conversation in the chat shell so you can read the actual transcript that produced the stop_reason / token count.
The metrics, defined #
Every metric below is computed against agent_messages — the table the
runtime writes to inside the same transaction as the run-stream's done
frame. We picked it over usage_records (the billing surface) for run
history because it's the most consistent view of what the user actually
saw; usage_records may diverge on retry double-emits or failed-persist
edge cases.
| Metric | Definition | Source | Caveat |
|---|---|---|---|
runs |
Distinct user-turn → assistant-completion pairs in the window | agent_messages (assistant rows; preceding user row identified via LAG()) |
One user turn that re-asked mid-stream is one assistant row, so one run — not two |
failedRuns |
Runs whose stop_reason is not in ('end_turn', 'stop_sequence') (or is null) |
agent_messages.stop_reason |
A runtime crash mid-stream that never persisted the assistant row → no row, no count, no alert. Cross-check with the /events tail when run counts disagree |
inputTokens / outputTokens |
Sum of agent_messages.input_tokens / output_tokens over the window |
Persisted by the runtime at done-frame time | Excludes orphan assistant rows the runtime wrote before crashing without token counts |
durationMs |
assistant.created_at - preceding_user.created_at for the same conversation |
agent_messages.created_at, paired via LAG() OVER (PARTITION BY conversation_id ORDER BY created_at, id) |
Excludes time the agent spent waiting on a CLI tool the runtime didn't persist; null when the assistant row is an orphan (no preceding user row) |
p50 / p95 / p99 |
Exact percentile via PERCENTILE_CONT over the same window |
Derived from durationMs |
Excludes runs where durationMs is null. The denominator differs from runs — orphan rows contribute to runs but not to the percentiles |
stop_reason classification is deliberately conservative: anything not
in COMPLETED_STOP_REASONS = ('end_turn', 'stop_sequence') — including
null, error, tool_error, max_tokens, max_iterations,
timeout, and any future runtime-emitted value — counts as failed.
Two exclusions carve out non-failure terminations:
cancelled— operator killed the run (see agent-runs.md)blocked:<kind>— a guardrail terminated the run (see policy-and-guardrails.md)
Both buckets are disjoint from failed, and the failure-rate watchdog
excludes them from both numerator and denominator — a policy block is
a policy success, not an agent quality failure.
Why agent_messages, not usage_records? #
usage_records is the billing surface — what we feed Stripe. It's
correct for invoicing but can double-emit on retried persists, and the
historical migration in #255 left a small gap of failed-persist rows
the billing reconciler papered over. For a dashboard that says "did
this user-facing turn succeed?", agent_messages is the more honest
source. PR-1 of #270 locked this choice.
API reference #
All four endpoints share an auth gate — bearer token (CLI / scripts) or
session cookie (dashboard) — and resolve the caller's org from
:slug. An API key bound to org A cannot read org B even with raw
cursor manipulation; every query is filtered by organization_id at
the join.
Every error response is { "error": "<code>", "detail"?: "<message>" }.
A malformed from or to returns 400 invalid_from / 400 invalid_to
rather than silently returning everything.
GET /v1/orgs/:slug/agent-runs/metrics #
Org-level aggregation for the /agents dashboard. Default window: last
30 days when neither from nor to is supplied.
| Query param | Type | Default | Meaning |
|---|---|---|---|
from |
ISO 8601 datetime | to - 30d |
Lower bound on assistant.created_at (inclusive) |
to |
ISO 8601 datetime | now | Upper bound on assistant.created_at (exclusive) |
Response:
{
"window": { "from": "2026-04-08T12:00:00.000Z", "to": "2026-05-08T12:00:00.000Z", "days": 30 },
"totals": {
"runs": 4218,
"failedRuns": 47,
"inputTokens": 18223451,
"outputTokens": 2811042
},
"runsByDay": [
{ "date": "2026-04-08", "runs": 142, "failedRuns": 2 },
{ "date": "2026-04-09", "runs": 0, "failedRuns": 0 },
// ... one row per UTC day in the window, zero-filled
],
"topAgentsByActivity": [
{ "agentId": "dep_4kp...", "agentName": "support-triage", "runs": 1408, "failedRuns": 12 },
// ... up to 5
],
"topAgentsByErrorRate": [
// agents with < 10 runs in the window are excluded — a 1-run / 1-failure
// agent shouldn't outrank a real workload at 8% over hundreds.
{ "agentId": "dep_zz9...", "agentName": "release-notes", "runs": 21, "failedRuns": 5, "errorRate": 0.238 }
// ... up to 5
]
}runsByDay.length === window.days by construction — the SQL
zero-fills via generate_series, so a day with no runs is runs: 0,
not a missing entry.
curl -fsSL "https://api.stech.com/v1/orgs/$ORG/agent-runs/metrics" \
-H "Authorization: Bearer $STECH_API_KEY"GET /v1/orgs/:slug/agents/:id/metrics #
Per-agent metrics for the chat-shell strip. Default window: last 7 days (narrower than the org default because the strip lives above an active chat — recent context, not historical sweep).
| Query param | Type | Default | Meaning |
|---|---|---|---|
from |
ISO 8601 datetime | to - 7d |
Lower bound (inclusive) |
to |
ISO 8601 datetime | now | Upper bound (exclusive) |
Response:
{
"window": { "from": "...", "to": "...", "days": 7 },
"totals": { "runs": 411, "failedRuns": 6, "inputTokens": 1820354, "outputTokens": 281193 },
"runsByDay": [ /* 7 rows, zero-filled */ ],
"p95DurationMs": 4218,
"p99DurationMs": 9104
}p95DurationMs and p99DurationMs are null when the window has no
runs with a paired user turn. PERCENTILE_CONT skips null durations
per Postgres semantics, so the percentiles' denominator is "runs with a
known duration" — not the same as totals.runs, which counts all
assistant rows including orphans.
curl -fsSL "https://api.stech.com/v1/orgs/$ORG/agents/$AGENT_ID/metrics" \
-H "Authorization: Bearer $STECH_API_KEY"GET /v1/orgs/:slug/agents/:id/runs #
Paginated run history. One row per assistant turn. Aggregations are summed over the same filter window — narrowing by date / status / conversation narrows the totals too.
| Query param | Type | Default | Meaning |
|---|---|---|---|
limit |
int | 50 | Max rows per page (capped at 200) |
cursor |
base64 string | none | Opaque cursor from a prior nextCursor |
from |
ISO 8601 datetime | none | Lower bound (inclusive) |
to |
ISO 8601 datetime | none | Upper bound (exclusive) |
status |
comma-separated | all | Subset of completed,failed,cancelled,blocked (see agent-runs.md for cancelled, policy-and-guardrails.md for blocked) |
conversationId |
string | none | Narrow to a single conversation |
Response:
{
"runs": [
{
"id": "msg_8w3...",
"conversationId": "cnv_7m1...",
"agentId": "dep_4kp...",
"createdAt": "2026-05-08T14:09:51.103Z",
"durationMs": 3421,
"stopReason": "end_turn",
"status": "completed",
"iterations": 3,
"inputTokens": 4218,
"outputTokens": 612,
"userPrompt": "what's the status of ticket 42?"
}
// ...
],
"aggregations": {
"totalRuns": 411,
"failedRuns": 6,
"totalInputTokens": 1820354,
"totalOutputTokens": 281193,
"p50DurationMs": 1820,
"p95DurationMs": 4218,
"p99DurationMs": 9104
},
"nextCursor": "eyJ0cyI6IjIwMjYtMDUtMDhUMTQ6MDk6NTEuMTAzWiIsImlkIjoibXNnXzh3MyJ9"
}nextCursor is null when there are no more rows. userPrompt is
truncated to 120 characters with a trailing …; the full prompt is
on the conversation row (and in the CSV export below).
curl -fsSL "https://api.stech.com/v1/orgs/$ORG/agents/$AGENT_ID/runs?status=failed&limit=100" \
-H "Authorization: Bearer $STECH_API_KEY"GET /v1/orgs/:slug/agents/:id/runs.csv #
CSV export of the same row set as /runs. RFC 4180 escaping, CRLF line
terminators, full (untruncated) user_prompt. Single-shot — no cursor
pagination.
| Query param | Type | Default | Meaning |
|---|---|---|---|
limit |
int | 1000 | Max rows in the export (hard cap: 50000) |
from, to, status, conversationId |
same as /runs |
— | Same semantics as the JSON endpoint |
Response headers:
Content-Type: text/csv; charset=utf-8Content-Disposition: attachment; filename="runs-<agent-slug>-<from-iso>.csv"
Body: a header row followed by N data rows, all CRLF-terminated.
| Column | Notes |
|---|---|
run_id |
agent_messages.id |
conversation_id |
agent_conversations.id |
created_at_iso |
ISO 8601 UTC |
duration_ms |
empty when null (orphan assistant) |
status |
completed or failed |
stop_reason |
raw value from agent_messages.stop_reason, possibly empty |
iterations |
tool-call iteration count |
input_tokens / output_tokens |
as persisted |
user_prompt |
the full preceding user-turn text, RFC-4180 escaped |
Over the 50000-row cap returns 400 csv_export_too_large —
narrow the date range. An empty agent returns the header row only
(still a valid file).
# -O writes the file; -J honors the server-provided filename
curl -fsSL -OJ "https://api.stech.com/v1/orgs/$ORG/agents/$AGENT_ID/runs.csv?from=2026-05-01T00:00:00Z" \
-H "Authorization: Bearer $STECH_API_KEY"
# → runs-support-triage-2026-05-01.csv saved in cwdFailure-rate alerts #
When an agent's failure rate over the rolling last 50 runs crosses
20%, the api fires an audit.flagged webhook event. One event per
(agent, calendar-day-UTC) — a sustained outage doesn't flood
subscribers.
| Constant | Value | Source |
|---|---|---|
DEFAULT_WINDOW_SIZE |
50 | api/src/lib/agent-failure-alert.ts |
DEFAULT_FAILURE_RATE_THRESHOLD |
0.20 | same |
| Dedupe key | (agent_id, kind='agent_failure_rate', event_date::date) |
db/src/schema/agent-alerts.ts unique index |
The watchdog runs fire-and-forget after every successful run
persist (both the synchronous /run and streaming /run-stream paths),
so the check is bounded — one indexed COUNT + one indexed lookup +
on emit-day one INSERT and one webhook fan-out. Errors anywhere in the
hook are logged and swallowed; a failed alert can never fault the
upstream run.
Event shape #
The event uses the existing audit.flagged type so subscribers don't
have to add a new event to their endpoint config — see
webhooks.md event catalog for the
envelope, signing scheme, retry policy.
The data payload for a failure-rate alert:
{
"id": "1f2e3d4c-5b6a-7889-99aa-bbccddeeff00",
"type": "audit.flagged",
"createdAt": "2026-05-08T14:22:08.554Z",
"organizationId": "org_2t4b...",
"data": {
"kind": "agent_failure_rate",
"agentId": "dep_4kp...",
"agentName": "support-triage",
"windowSize": 50,
"failedCount": 12,
"failureRate": 0.24,
"threshold": 0.2
}
}kind discriminates this event from the other curated audit.flagged
events (token revocation, OAuth disconnect, source deletion, SSO
updates, webhook secret rotation, plan changes, deployment supersede —
all use the same envelope). Receivers should branch on
data.kind === "agent_failure_rate" before touching the failure-rate
specific fields.
Subscribing #
Register a webhook endpoint subscribed to audit.flagged (or * for
every event type):
curl -fsSL https://api.stech.com/v1/orgs/$ORG/webhook-endpoints \
-H "Authorization: Bearer $STECH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://hooks.acme.example/stech-failures",
"description": "agent failure-rate alerts → slack",
"events": ["audit.flagged"]
}'The full create / verify / rotate flow is in
webhooks.md (signing scheme, signed-body
verification, 7-attempt exponential backoff, auto-disable on 50
consecutive failures).
Slack bridge — copy-paste-runnable #
A 20-line Express receiver that re-posts failure-rate alerts to a Slack incoming webhook. Verifies the Stech HMAC before fanning out so a forged POST can't trick the bridge into spamming Slack.
import express from "express";
import { createHmac, timingSafeEqual } from "node:crypto";
const app = express();
const STECH_SECRET = process.env.STECH_WEBHOOK_SECRET!;
const SLACK_URL = process.env.SLACK_INCOMING_WEBHOOK_URL!;
app.post("/stech", express.raw({ type: "application/json" }), async (req, res) => {
const ts = req.get("X-Stech-Timestamp"), sig = req.get("X-Stech-Signature");
if (!ts || !sig) return res.status(400).send("missing headers");
if (Math.abs(Date.now() / 1000 - Number(ts)) > 300) return res.status(400).send("replay");
const expected = "sha256=" + createHmac("sha256", STECH_SECRET)
.update(`${ts}.`).update(req.body).digest("hex");
const a = Buffer.from(expected), b = Buffer.from(sig);
if (a.length !== b.length || !timingSafeEqual(a, b)) return res.status(400).send("bad sig");
const event = JSON.parse(req.body.toString());
if (event.type === "audit.flagged" && event.data?.kind === "agent_failure_rate") {
const { agentName, failureRate, failedCount, windowSize, threshold } = event.data;
const pct = (n: number) => `${(n * 100).toFixed(1)}%`;
await fetch(SLACK_URL, {
method: "POST",
headers: { "content-type": "application/json" },
body: JSON.stringify({
text: `:rotating_light: agent *${agentName}* failure rate ${pct(failureRate)} ` +
`(${failedCount}/${windowSize} runs, threshold ${pct(threshold)})`,
}),
});
}
res.status(200).send("ok");
});
app.listen(8080);Run with STECH_WEBHOOK_SECRET=whsec_… SLACK_INCOMING_WEBHOOK_URL=https://hooks.slack.com/… node bridge.js.
Operating tips #
Time-range convention. from is inclusive, to is
exclusive — from=2026-05-01T00:00:00Z&to=2026-05-08T00:00:00Z
covers seven full UTC days and does not double-count the boundary
midnight. Most reporting tools prefer this convention; it's the same
shape the org metrics' default window uses.
Pagination. /runs uses cursor-based pagination on a
(created_at, id) tuple, not offset. Two reasons: stability under
concurrent inserts (a new run during paging won't shift rows or
duplicate them) and consistent latency on agents with millions of
historical runs (offset N requires the planner to scan N rows; the
tuple comparison hits the index directly). Cursors are opaque base64
JSON — don't parse or manipulate them.
Cross-org isolation. Every endpoint binds the query to the
:slug's organization at the join. An API key for org A cannot read
org B's data even with hand-crafted cursors or known agent ids — the
existence check returns 404 not_found for cross-org ids,
leak-safely.
Top-by-error-rate noise floor. topAgentsByErrorRate excludes
agents with fewer than 10 runs in the window
(ERROR_RATE_MIN_RUNS = 10). Otherwise a brand-new agent with one run
and one failure would always be the "worst", drowning out a real
workload sitting at 8% over hundreds of runs.
Limitations #
- No real-time streaming of metrics. Dashboard charts re-read on
page load. For the live event tail use
GET /v1/orgs/:slug/agents/:id/events(separate, SSE). - No per-tool-call latency breakdown.
durationMsis the user→ assistant gap; we don't currently persist per-tool-call timings inside the assistant row. Filed for a future epic if customers ask. - Failure-rate threshold + window are global constants. v1
hardcodes
windowSize = 50andthreshold = 0.20inapi/src/lib/agent-failure-alert.ts. Per-agent customization is deferred. - The
/runsroute's date-bound parser + the dashboard'sfailedExprSQL share a small refactor backlog tracked in #272.
Troubleshooting #
My runs/day chart shows zeros for days I know had runs. The
runtime persists agent_messages post-done-frame, inside the same
transaction. Runs that ended in a runtime crash mid-stream never
persisted, so they don't show in any of the metrics surfaces. Cross-
check with the api logs ([agents] stream persist failed for conv …)
or the live events tail at /agents/[id]/events.
p95DurationMs doesn't match what I see in the run history.
PERCENTILE_CONT skips null durations. The run history table shows
orphan assistants (no preceding user turn) with durationMs: null,
and the strip's totals count them as runs — so the percentile
denominator is smaller than the run count. By design: a "how long does
a run take?" stat shouldn't be skewed by rows that have no
well-defined start.
I never get my failure-rate alert. Three causes, returned by the
watchdog as a reason:
window_unfilled— the agent has fewer than 50 runs in total. A brand-new agent's first failure won't alert.below_threshold— failure rate over the last 50 is < 20%.already_emitted_today— there's already anagent_alertsrow for(agent_id, 'agent_failure_rate', today's UTC date). We dedupe to one alert per day; checkagent_alertsto confirm.
The watchdog logs the reason on every miss-of-threshold but a successful
no-emit; grep the api logs for [agent-failure-alert] to see the
return value. The agent_alerts table is queryable directly — one row
per emit-day-per-agent.
CSV export is empty / 400 too-many-rows. Either no runs match your
filter (most often a from / to window the agent didn't fire in),
or the row set exceeds 50000. Narrow the date range, then retry. The
endpoint accepts the same status=failed filter as the JSON sibling
if you only want the failures.
Related #
- Agent runs — cancellation — the
cancelledstatus pill, the cancel API + webhook (agent_run.cancelled), and why cancellations are excluded from the failure-rate watchdog's numerator AND denominator. - Policy and guardrails — the
blocked:<kind>status bucket, thetopGuardrailsByBlocksaggregation, the dedicatedagent_run.blockedwebhook channel, and why blocks share the cancel-style failure-rate exclusion shape. - Audit log — once the failure-rate watchdog fires
audit.flagged, this is where you go to ask "who triggered the run that crossed the threshold?" The flagged event is for live alerting; the audit log is the reviewable retrospective. - Webhooks —
audit.flaggedenvelope, signing scheme, retry policy, dedupe onevent.id. The failure-rate alert above is one of the curatedaudit.flaggedpayloads. - CLI tool sources — agent runs that fork CLI
binaries fire
agent_run.completed/agent_run.failed; the failure-rate watchdog observes those persists like any other run. - Billing and usage — the metrics here are
drawn from
agent_messages(most-honest view of the user-facing turn);usage_recordsis the parallel billing surface that drives the cost-control caps and Stripe overage.