Observability

Customers running agents in production need to answer three questions: is it working, how fast is it, how much is it spending. The observability surface is a read-side view over agent_messages — every assistant turn the runtime persisted — aggregated three ways: per-org, per-agent, and per-run.

No new tables, no event stream to subscribe to for metrics; the data is rolled up at query time from the same rows the chat shell already shows.

Where to look #

Three dashboard surfaces, three matching API endpoints.

Dashboard	API	What it shows
`/agents`	`GET /v1/orgs/:slug/agent-runs/metrics`	Org-wide last-30d runs/day chart, top 5 agents by activity, top 5 by error rate, totals
`/agents/[id]`	`GET /v1/orgs/:slug/agents/:id/metrics`	Per-agent last-7d run count, failure rate, p95 latency strip above the chat
`/agents/[id]/runs`	`GET /v1/orgs/:slug/agents/:id/runs`	Every run for the agent, filterable + paginated, click-through to the conversation

Each row of the run history table links to the conversation in the chat shell so you can read the actual transcript that produced the stop_reason / token count.

The metrics, defined #

Every metric below is computed against agent_messages — the table the runtime writes to inside the same transaction as the run-stream's done frame. We picked it over usage_records (the billing surface) for run history because it's the most consistent view of what the user actually saw; usage_records may diverge on retry double-emits or failed-persist edge cases.

Metric	Definition	Source	Caveat
`runs`	Distinct user-turn → assistant-completion pairs in the window	`agent_messages` (assistant rows; preceding user row identified via `LAG()`)	One user turn that re-asked mid-stream is one assistant row, so one run — not two
`failedRuns`	Runs whose `stop_reason` is not in `('end_turn', 'stop_sequence')` (or is null)	`agent_messages.stop_reason`	A runtime crash mid-stream that never persisted the assistant row → no row, no count, no alert. Cross-check with the `/events` tail when run counts disagree
`inputTokens` / `outputTokens`	Sum of `agent_messages.input_tokens` / `output_tokens` over the window	Persisted by the runtime at done-frame time	Excludes orphan assistant rows the runtime wrote before crashing without token counts
`durationMs`	`assistant.created_at - preceding_user.created_at` for the same conversation	`agent_messages.created_at`, paired via `LAG() OVER (PARTITION BY conversation_id ORDER BY created_at, id)`	Excludes time the agent spent waiting on a CLI tool the runtime didn't persist; null when the assistant row is an orphan (no preceding user row)
`p50` / `p95` / `p99`	Exact percentile via `PERCENTILE_CONT` over the same window	Derived from `durationMs`	Excludes runs where `durationMs` is null. The denominator differs from `runs` — orphan rows contribute to `runs` but not to the percentiles

stop_reason classification is deliberately conservative: anything not in COMPLETED_STOP_REASONS = ('end_turn', 'stop_sequence') — including null, error, tool_error, max_tokens, max_iterations, timeout, and any future runtime-emitted value — counts as failed. Two exclusions carve out non-failure terminations:

cancelled — operator killed the run (see agent-runs.md)
blocked:<kind> — a guardrail terminated the run (see policy-and-guardrails.md)

Both buckets are disjoint from failed, and the failure-rate watchdog excludes them from both numerator and denominator — a policy block is a policy success, not an agent quality failure.

Why `agent_messages`, not `usage_records`? #

usage_records is the billing surface — what we feed Stripe. It's correct for invoicing but can double-emit on retried persists, and the historical migration in #255 left a small gap of failed-persist rows the billing reconciler papered over. For a dashboard that says "did this user-facing turn succeed?", agent_messages is the more honest source. PR-1 of #270 locked this choice.

API reference #

All four endpoints share an auth gate — bearer token (CLI / scripts) or session cookie (dashboard) — and resolve the caller's org from :slug. An API key bound to org A cannot read org B even with raw cursor manipulation; every query is filtered by organization_id at the join.

Every error response is { "error": "<code>", "detail"?: "<message>" }. A malformed from or to returns 400 invalid_from / 400 invalid_to rather than silently returning everything.

`GET /v1/orgs/:slug/agent-runs/metrics` #

Org-level aggregation for the /agents dashboard. Default window: last 30 days when neither from nor to is supplied.

Query param	Type	Default	Meaning
`from`	ISO 8601 datetime	`to - 30d`	Lower bound on `assistant.created_at` (inclusive)
`to`	ISO 8601 datetime	now	Upper bound on `assistant.created_at` (exclusive)

Response:

{
  "window": { "from": "2026-04-08T12:00:00.000Z", "to": "2026-05-08T12:00:00.000Z", "days": 30 },
  "totals": {
    "runs": 4218,
    "failedRuns": 47,
    "inputTokens": 18223451,
    "outputTokens": 2811042
  },
  "runsByDay": [
    { "date": "2026-04-08", "runs": 142, "failedRuns": 2 },
    { "date": "2026-04-09", "runs": 0,   "failedRuns": 0 },
    // ... one row per UTC day in the window, zero-filled
  ],
  "topAgentsByActivity": [
    { "agentId": "dep_4kp...", "agentName": "support-triage", "runs": 1408, "failedRuns": 12 },
    // ... up to 5
  ],
  "topAgentsByErrorRate": [
    // agents with < 10 runs in the window are excluded — a 1-run / 1-failure
    // agent shouldn't outrank a real workload at 8% over hundreds.
    { "agentId": "dep_zz9...", "agentName": "release-notes", "runs": 21, "failedRuns": 5, "errorRate": 0.238 }
    // ... up to 5
  ]
}

runsByDay.length === window.days by construction — the SQL zero-fills via generate_series, so a day with no runs is runs: 0, not a missing entry.

curl -fsSL "https://api.stech.com/v1/orgs/$ORG/agent-runs/metrics" \
  -H "Authorization: Bearer $STECH_API_KEY"

`GET /v1/orgs/:slug/agents/:id/metrics` #

Per-agent metrics for the chat-shell strip. Default window: last 7 days (narrower than the org default because the strip lives above an active chat — recent context, not historical sweep).

Query param	Type	Default	Meaning
`from`	ISO 8601 datetime	`to - 7d`	Lower bound (inclusive)
`to`	ISO 8601 datetime	now	Upper bound (exclusive)

Response:

{
  "window": { "from": "...", "to": "...", "days": 7 },
  "totals": { "runs": 411, "failedRuns": 6, "inputTokens": 1820354, "outputTokens": 281193 },
  "runsByDay": [ /* 7 rows, zero-filled */ ],
  "p95DurationMs": 4218,
  "p99DurationMs": 9104
}

p95DurationMs and p99DurationMs are null when the window has no runs with a paired user turn. PERCENTILE_CONT skips null durations per Postgres semantics, so the percentiles' denominator is "runs with a known duration" — not the same as totals.runs, which counts all assistant rows including orphans.

curl -fsSL "https://api.stech.com/v1/orgs/$ORG/agents/$AGENT_ID/metrics" \
  -H "Authorization: Bearer $STECH_API_KEY"

`GET /v1/orgs/:slug/agents/:id/runs` #

Paginated run history. One row per assistant turn. Aggregations are summed over the same filter window — narrowing by date / status / conversation narrows the totals too.

Query param	Type	Default	Meaning
`limit`	int	50	Max rows per page (capped at 200)
`cursor`	base64 string	none	Opaque cursor from a prior `nextCursor`
`from`	ISO 8601 datetime	none	Lower bound (inclusive)
`to`	ISO 8601 datetime	none	Upper bound (exclusive)
`status`	comma-separated	all	Subset of `completed,failed,cancelled,blocked` (see agent-runs.md for cancelled, policy-and-guardrails.md for blocked)
`conversationId`	string	none	Narrow to a single conversation

Response:

{
  "runs": [
    {
      "id": "msg_8w3...",
      "conversationId": "cnv_7m1...",
      "agentId": "dep_4kp...",
      "createdAt": "2026-05-08T14:09:51.103Z",
      "durationMs": 3421,
      "stopReason": "end_turn",
      "status": "completed",
      "iterations": 3,
      "inputTokens": 4218,
      "outputTokens": 612,
      "userPrompt": "what's the status of ticket 42?"
    }
    // ...
  ],
  "aggregations": {
    "totalRuns": 411,
    "failedRuns": 6,
    "totalInputTokens": 1820354,
    "totalOutputTokens": 281193,
    "p50DurationMs": 1820,
    "p95DurationMs": 4218,
    "p99DurationMs": 9104
  },
  "nextCursor": "eyJ0cyI6IjIwMjYtMDUtMDhUMTQ6MDk6NTEuMTAzWiIsImlkIjoibXNnXzh3MyJ9"
}

nextCursor is null when there are no more rows. userPrompt is truncated to 120 characters with a trailing …; the full prompt is on the conversation row (and in the CSV export below).

curl -fsSL "https://api.stech.com/v1/orgs/$ORG/agents/$AGENT_ID/runs?status=failed&limit=100" \
  -H "Authorization: Bearer $STECH_API_KEY"

`GET /v1/orgs/:slug/agents/:id/runs.csv` #

CSV export of the same row set as /runs. RFC 4180 escaping, CRLF line terminators, full (untruncated) user_prompt. Single-shot — no cursor pagination.

Query param	Type	Default	Meaning
`limit`	int	1000	Max rows in the export (hard cap: 50000)
`from`, `to`, `status`, `conversationId`	same as `/runs`	—	Same semantics as the JSON endpoint

Response headers:

Content-Type: text/csv; charset=utf-8
Content-Disposition: attachment; filename="runs-<agent-slug>-<from-iso>.csv"

Body: a header row followed by N data rows, all CRLF-terminated.

Column	Notes
`run_id`	`agent_messages.id`
`conversation_id`	`agent_conversations.id`
`created_at_iso`	ISO 8601 UTC
`duration_ms`	empty when null (orphan assistant)
`status`	`completed` or `failed`
`stop_reason`	raw value from `agent_messages.stop_reason`, possibly empty
`iterations`	tool-call iteration count
`input_tokens` / `output_tokens`	as persisted
`user_prompt`	the full preceding user-turn text, RFC-4180 escaped

Over the 50000-row cap returns 400 csv_export_too_large — narrow the date range. An empty agent returns the header row only (still a valid file).

# -O writes the file; -J honors the server-provided filename
curl -fsSL -OJ "https://api.stech.com/v1/orgs/$ORG/agents/$AGENT_ID/runs.csv?from=2026-05-01T00:00:00Z" \
  -H "Authorization: Bearer $STECH_API_KEY"
# → runs-support-triage-2026-05-01.csv saved in cwd

Failure-rate alerts #

When an agent's failure rate over the rolling last 50 runs crosses 20%, the api fires an audit.flagged webhook event. One event per (agent, calendar-day-UTC) — a sustained outage doesn't flood subscribers.

Constant	Value	Source
`DEFAULT_WINDOW_SIZE`	50	`api/src/lib/agent-failure-alert.ts`
`DEFAULT_FAILURE_RATE_THRESHOLD`	0.20	same
Dedupe key	`(agent_id, kind='agent_failure_rate', event_date::date)`	`db/src/schema/agent-alerts.ts` unique index

The watchdog runs fire-and-forget after every successful run persist (both the synchronous /run and streaming /run-stream paths), so the check is bounded — one indexed COUNT + one indexed lookup + on emit-day one INSERT and one webhook fan-out. Errors anywhere in the hook are logged and swallowed; a failed alert can never fault the upstream run.

Event shape #

The event uses the existing audit.flagged type so subscribers don't have to add a new event to their endpoint config — see webhooks.md event catalog for the envelope, signing scheme, retry policy.

The data payload for a failure-rate alert:

{
  "id": "1f2e3d4c-5b6a-7889-99aa-bbccddeeff00",
  "type": "audit.flagged",
  "createdAt": "2026-05-08T14:22:08.554Z",
  "organizationId": "org_2t4b...",
  "data": {
    "kind": "agent_failure_rate",
    "agentId": "dep_4kp...",
    "agentName": "support-triage",
    "windowSize": 50,
    "failedCount": 12,
    "failureRate": 0.24,
    "threshold": 0.2
  }
}

kind discriminates this event from the other curated audit.flagged events (token revocation, OAuth disconnect, source deletion, SSO updates, webhook secret rotation, plan changes, deployment supersede — all use the same envelope). Receivers should branch on data.kind === "agent_failure_rate" before touching the failure-rate specific fields.

Subscribing #

curl -fsSL https://api.stech.com/v1/orgs/$ORG/webhook-endpoints \
  -H "Authorization: Bearer $STECH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://hooks.acme.example/stech-failures",
    "description": "agent failure-rate alerts → slack",
    "events": ["audit.flagged"]
  }'

The full create / verify / rotate flow is in webhooks.md (signing scheme, signed-body verification, 7-attempt exponential backoff, auto-disable on 50 consecutive failures).

Slack bridge — copy-paste-runnable #

A 20-line Express receiver that re-posts failure-rate alerts to a Slack incoming webhook. Verifies the Stech HMAC before fanning out so a forged POST can't trick the bridge into spamming Slack.

import express from "express";
import { createHmac, timingSafeEqual } from "node:crypto";

const app = express();
const STECH_SECRET = process.env.STECH_WEBHOOK_SECRET!;
const SLACK_URL = process.env.SLACK_INCOMING_WEBHOOK_URL!;

app.post("/stech", express.raw({ type: "application/json" }), async (req, res) => {
  const ts = req.get("X-Stech-Timestamp"), sig = req.get("X-Stech-Signature");
  if (!ts || !sig) return res.status(400).send("missing headers");
  if (Math.abs(Date.now() / 1000 - Number(ts)) > 300) return res.status(400).send("replay");

  const expected = "sha256=" + createHmac("sha256", STECH_SECRET)
    .update(`${ts}.`).update(req.body).digest("hex");
  const a = Buffer.from(expected), b = Buffer.from(sig);
  if (a.length !== b.length || !timingSafeEqual(a, b)) return res.status(400).send("bad sig");

  const event = JSON.parse(req.body.toString());
  if (event.type === "audit.flagged" && event.data?.kind === "agent_failure_rate") {
    const { agentName, failureRate, failedCount, windowSize, threshold } = event.data;
    const pct = (n: number) => `${(n * 100).toFixed(1)}%`;
    await fetch(SLACK_URL, {
      method: "POST",
      headers: { "content-type": "application/json" },
      body: JSON.stringify({
        text: `:rotating_light: agent *${agentName}* failure rate ${pct(failureRate)} ` +
              `(${failedCount}/${windowSize} runs, threshold ${pct(threshold)})`,
      }),
    });
  }
  res.status(200).send("ok");
});

app.listen(8080);

Run with STECH_WEBHOOK_SECRET=whsec_… SLACK_INCOMING_WEBHOOK_URL=https://hooks.slack.com/… node bridge.js.

Operating tips #

Time-range convention. from is inclusive, to is exclusive — from=2026-05-01T00:00:00Z&to=2026-05-08T00:00:00Z covers seven full UTC days and does not double-count the boundary midnight. Most reporting tools prefer this convention; it's the same shape the org metrics' default window uses.

Pagination. /runs uses cursor-based pagination on a (created_at, id) tuple, not offset. Two reasons: stability under concurrent inserts (a new run during paging won't shift rows or duplicate them) and consistent latency on agents with millions of historical runs (offset N requires the planner to scan N rows; the tuple comparison hits the index directly). Cursors are opaque base64 JSON — don't parse or manipulate them.

Cross-org isolation. Every endpoint binds the query to the :slug's organization at the join. An API key for org A cannot read org B's data even with hand-crafted cursors or known agent ids — the existence check returns 404 not_found for cross-org ids, leak-safely.

Top-by-error-rate noise floor. topAgentsByErrorRate excludes agents with fewer than 10 runs in the window (ERROR_RATE_MIN_RUNS = 10). Otherwise a brand-new agent with one run and one failure would always be the "worst", drowning out a real workload sitting at 8% over hundreds of runs.

Limitations #

No real-time streaming of metrics. Dashboard charts re-read on page load. For the live event tail use GET /v1/orgs/:slug/agents/:id/events (separate, SSE).
No per-tool-call latency breakdown. durationMs is the user→ assistant gap; we don't currently persist per-tool-call timings inside the assistant row. Filed for a future epic if customers ask.
Failure-rate threshold + window are global constants. v1 hardcodes windowSize = 50 and threshold = 0.20 in api/src/lib/agent-failure-alert.ts. Per-agent customization is deferred.
The /runs route's date-bound parser + the dashboard's failedExpr SQL share a small refactor backlog tracked in #272.

Troubleshooting #

My runs/day chart shows zeros for days I know had runs. The runtime persists agent_messages post-done-frame, inside the same transaction. Runs that ended in a runtime crash mid-stream never persisted, so they don't show in any of the metrics surfaces. Cross- check with the api logs ([agents] stream persist failed for conv …) or the live events tail at /agents/[id]/events.

p95DurationMs doesn't match what I see in the run history. PERCENTILE_CONT skips null durations. The run history table shows orphan assistants (no preceding user turn) with durationMs: null, and the strip's totals count them as runs — so the percentile denominator is smaller than the run count. By design: a "how long does a run take?" stat shouldn't be skewed by rows that have no well-defined start.

I never get my failure-rate alert. Three causes, returned by the watchdog as a reason:

window_unfilled — the agent has fewer than 50 runs in total. A brand-new agent's first failure won't alert.
below_threshold — failure rate over the last 50 is < 20%.
already_emitted_today — there's already an agent_alerts row for (agent_id, 'agent_failure_rate', today's UTC date). We dedupe to one alert per day; check agent_alerts to confirm.

The watchdog logs the reason on every miss-of-threshold but a successful no-emit; grep the api logs for [agent-failure-alert] to see the return value. The agent_alerts table is queryable directly — one row per emit-day-per-agent.

CSV export is empty / 400 too-many-rows. Either no runs match your filter (most often a from / to window the agent didn't fire in), or the row set exceeds 50000. Narrow the date range, then retry. The endpoint accepts the same status=failed filter as the JSON sibling if you only want the failures.

Agent runs — cancellation — the cancelled status pill, the cancel API + webhook (agent_run.cancelled), and why cancellations are excluded from the failure-rate watchdog's numerator AND denominator.
Policy and guardrails — the blocked:<kind> status bucket, the topGuardrailsByBlocks aggregation, the dedicated agent_run.blocked webhook channel, and why blocks share the cancel-style failure-rate exclusion shape.
Audit log — once the failure-rate watchdog fires audit.flagged, this is where you go to ask "who triggered the run that crossed the threshold?" The flagged event is for live alerting; the audit log is the reviewable retrospective.
Webhooks — audit.flagged envelope, signing scheme, retry policy, dedupe on event.id. The failure-rate alert above is one of the curated audit.flagged payloads.
CLI tool sources — agent runs that fork CLI binaries fire agent_run.completed / agent_run.failed; the failure-rate watchdog observes those persists like any other run.
Billing and usage — the metrics here are drawn from agent_messages (most-honest view of the user-facing turn); usage_records is the parallel billing surface that drives the cost-control caps and Stripe overage.

edit this page on github →

Observability

Where to look #

The metrics, defined #

Why agent_messages, not usage_records? #

API reference #

GET /v1/orgs/:slug/agent-runs/metrics #

GET /v1/orgs/:slug/agents/:id/metrics #

GET /v1/orgs/:slug/agents/:id/runs #

GET /v1/orgs/:slug/agents/:id/runs.csv #

Failure-rate alerts #

Event shape #

Subscribing #

Slack bridge — copy-paste-runnable #

Operating tips #

Limitations #

Troubleshooting #

Related #

Why `agent_messages`, not `usage_records`? #

`GET /v1/orgs/:slug/agent-runs/metrics` #

`GET /v1/orgs/:slug/agents/:id/metrics` #

`GET /v1/orgs/:slug/agents/:id/runs` #

`GET /v1/orgs/:slug/agents/:id/runs.csv` #