all systems operationalv0.17.10
stech/

Agent versioning

The versioning surface lets operators promote, canary, and roll back agent deployments without re-provisioning the underlying Fly machine. A stech deploy mints a new immutable version; channel pointers (stable / canary) decide which version takes traffic. Rollback is a row-update on the channel pointer, not a re-deploy — the previous version's machine keeps running and traffic flips back instantly.

This page is the catalog: the model, the schema, the API surface, the CLI surface, the dashboard, the lifecycle webhooks, and the operator runbook for the sticky-by-conversation invariant that governs in-flight runs.

Why #

Before this epic, stech deploy did a hard swap: the new deployment flipped to live and every prior live row for the same (org, agent) was marked superseded and its Fly machine destroyed (epic #178). A bad deploy meant 100% of traffic instantly hit the broken version, and the previous-known-good was gone. There was no canary, no rollback.

Versioning (#297) adds the lever to act on that signal: deployments stay immutable, only channel pointers move. A bad version doesn't have to be rebuilt from source — stech rollback flips stable back to the previous agent_versions row in one txn.

The model #

Three concepts, layered on top of the existing deployments table:

  • Deployment — the immutable artefact (tarball + Fly machine config) at a specific agent.ts point-in-time. Already exists as deployments rows; identity is id (cuid2). Versioning does not touch this table.
  • Version — a human-friendly label (v1, v2, v3.5-canary) attached to a deployment for memorability and rollback. Auto- assigned v<N> per agent at deploy time; manual overlay via stech version label. Lives in agent_versions.
  • Channel — a logical pointer (stable / canary) that maps an agent to a deployment plus a traffic_weight (0..1). The "live deployment for this agent" concept becomes a derived view of agent_channels rows where channel_name='stable'. Lives in agent_channels.

Critically: deployments stay immutable; only channel pointers move. Rolling back is a row-update on agent_channels, not a re-provision. The Fly machine for the previous version stays running (subject to retention) so traffic can flip back instantly.

A fresh stech deploy always lands as a new version, never auto-attached to a channel. The operator promotes explicitly via stech canary set / stech canary promote. This makes deploys safe ("just deploy, traffic doesn't move") and channel changes deliberate ("traffic moves only when I move it").

Schema #

Two tables in db/src/schema/ — see agent-versions.ts and agent-channels.ts.

agent_versions #

Column Type Notes
id cuid2 PK
organization_id FK → organizations.id (cascade) denormalised so (org, agent_name) is single-table
agent_name text matches deployments.agent_name
deployment_id FK → deployments.id (cascade) the version is meaningless without the deployment
label text NOT NULL auto-assigned v<N> per agent; manual overlay via stech version label
notes text NULL free-form, surfaced in dashboard list
created_by_user_id FK → users.id (set null) audit history survives operator churn
created_at, updated_at timestamptz

UNIQUE on (organization_id, agent_name, label) — two versions of the same agent can't collide on label. Index on (organization_id, agent_name, created_at DESC) for the version-list query.

agent_channels #

Column Type Notes
id cuid2 PK
organization_id FK → organizations.id (cascade)
agent_name text
channel_name text NOT NULL CHECK-locked to 'stable' | 'canary'
deployment_id FK → deployments.id (set null) nullable so a cleared canary keeps its row
traffic_weight numeric(4,3) NOT NULL CHECK 0.0 ≤ w ≤ 1.0; resolution down to 0.001 (0.1% canary)
updated_by_user_id FK → users.id (set null) audit history
created_at, updated_at timestamptz

UNIQUE on (organization_id, agent_name, channel_name). Index on (organization_id, agent_name) for the run-stream router's hot-path lookup.

Invariants #

  • channel_name vocabulary is closed at write time. The CHECK refuses typos like 'staging' so a misnamed channel can't silently never-route. Widen the IN-list (no row migration) if more channels land.
  • traffic_weight ∈ [0, 1]. The CHECK refuses 1.5 so 150% of routing dice can't go to one channel. The api also enforces canary.weight ≤ 0.5 ("canary stays small") at the route layer.
  • deployment_id nullable on channels. A cleared canary keeps the row (audit + future re-set without losing created_at); the router treats NULL as "channel inactive, do not consider for routing".
  • No auto-attach on deploy. mintAgentVersionOnDeploy writes to agent_versions only; agent_channels is operator-driven.

Backfill #

The schema migration (db/drizzle/0027_agent_versioning.sql) is DDL- only. The data backfill — for each existing live deployment, mint a v<N> versions row + a stable channel at weight 1.0 — lives in api/scripts/backfill-agent-channels.ts. Idempotent: skips agents that already have a stable channel. Run as bun --filter '@stech/api' run backfill-agent-channels.

Routing #

Run-stream entry resolves (org, agent_name) from the requested deployment id, then consults agent_channels to pick a target by weighted hash. See api/src/lib/agent-router.ts.

  • Bucketing. bucketFor(key) = SHA-256 first-4-byte uint32 mod 10_000. Deterministic, distributes uniformly. pickChannel walks routable channels in lexicographic name order (canary < stable), accumulates weights, picks the first cumulative-mass-passing channel.
  • Sticky-by-conversation. The hash key is the conversationId alone — once a conversation lands on a channel, all subsequent runs in that conversation hit the same channel for the lifetime of the conversation, even if weights change later. One-shot mode (no conversationId) hashes a fresh crypto.randomUUID() so each run picks independently and the weight ratio holds over many calls.
  • Cache. In-process Map, 5s TTL per (org, agent). Mutation endpoints call invalidateRoutingCacheFor(org, agent) so a canary set propagates faster than the TTL. Multi-replica divergence is bounded by the TTL — eventually consistent.
  • Fallback. No channels → fall through to the requested deployment id. Legacy / un-backfilled / cleared-channels states don't go dark.
  • Conversation history. Persisted-thread mode reads/writes agent_messages against the conversation, not the deployment, so prior messages survive a channel flip. A canary version receiving a conversation that was previously on stable inherits whatever message history the previous version produced.

API #

All routes under /v1/orgs/:slug/agents/:agentName/.... GETs are any-member; mutations require an elevated role (owner / admin), same gate as stech deploy. See api/src/routes/agent-versions.ts.

Method Path Auth Purpose
GET /versions any-member list versions + active channels for an agent
POST /versions/:label elevated overlay a manual label on a deployment
GET /channels any-member list channels (stable, canary, …)
PUT /channels/:name elevated set deployment + weight on a channel
DELETE /channels/:name elevated clear a channel (canary only; stable refused)
POST /rollback elevated flip stable to a previous version atomically

Validation:

  • name ∈ {stable, canary} (matches the DB CHECK so a typo surfaces as 400, not a Postgres 500).
  • weight ∈ [0, 1]; for name='canary', weight ≤ 0.5.
  • DELETE /channels/stable is refused — stable is the implicit fall- through; clearing it would leave the agent unreachable.

PUT /channels/:name accepts either versionLabel (preferred) or deploymentId in the body and resolves to a deployment_id server- side. Each mutation calls invalidateRoutingCacheFor and fires a lifecycle webhook (see Webhooks).

POST /rollback accepts { to?: string } — omit for "one step back by created_at", supply a label for arbitrary. Stable upsert + canary clear run in a single transaction. Refuses with 404 no_rollback_target if there's no previous version (or the named label doesn't exist).

Auto-mint on deploy #

POST /v1/orgs/:slug/agents (the upload route) calls mintAgentVersionOnDeploy after the deployments INSERT. Sequential v<N> per agent; manual labels (v3.5-canary) are skipped when computing the next ordinal so the auto sequence stays predictable. A UNIQUE collision retries once on the next ordinal; a persistent collision logs and the deploy still succeeds — versions are metadata, the deployment row is the source of truth for "is the agent live?".

The mint fires agent_version.deployed; it does not touch agent_channels.

CLI #

Three top-level command families, defined in cli/src/runtime/versioning.ts. Each mutating command re-fetches channels post-mutation and prints the stable: v3 (90%) · canary: v4 (10%) one-liner so the operator sees the resulting state without a follow-up stech version list.

stech version #

stech version list <agent> [--json]
        Show every version of <agent>; marks active channels.

stech version label <agent> <deployment-id> <label>
        Overlay a memorable label (e.g. 'v3.5-canary') on a
        deployment. Replaces the auto-assigned 'v<N>' label.

stech version list prints a table with columns label, deployment (short id), status (live / failed / superseded / destroyed), channel (which channel(s) point here), created. The active-channel header line above the table mirrors the post-mutation confirmation.

stech canary #

stech canary set <agent> <version> --weight=<N>
        Point canary at <version> with N% of new traffic.
        <version> may be a label (v3) or a deployment id.
        --weight is 1..50 (canary stays small; use 'promote' to swap).

stech canary promote <agent>
        Atomic-ish: canary becomes stable; canary channel is cleared.

stech canary remove <agent>
        Clear the canary channel; all NEW conversations route to stable.

--weight is integer percent (1..50) in the CLI for ergonomics; the CLI converts to a 0..1 fraction before calling the API. Values

50 are refused — that's not a canary, that's a swap-in-disguise; use canary promote instead.

canary promote composes two existing primitives client-side: GET channels → PUT /channels/stable with the canary's deployment at weight 1.0 → DELETE /channels/canary. The PUT runs first so a failure between the two steps leaves stable correctly pointing at the formerly-canary deployment (the worse failure mode would be canary- cleared-but-stable-still-old).

stech rollback #

stech rollback <agent>
        Revert stable to the previous version (one step back).

stech rollback <agent> --to=<version>
        Revert stable to an arbitrary previous version (label).

The api runs the rollback in one transaction (stable upsert + canary clear). Existing conversations pinned to the previously-stable version finish on it per the sticky-by-conversation invariant; only NEW conversations route to the rolled-back version.

Dashboard #

/agents/[id]/versions — newest-first list of versions for an agent. Source files in app/app/agents/[id]/versions/.

Per-row columns:

  • Label (v3, v4, etc.) and created at / created by.
  • Channel badges — inline stable · 100% / canary · 10% on the row that channel currently points at, with weight tags.
  • Deployment-status pilllive / failed / superseded / destroyed. An operator can see at a glance whether a version's machine is still around to roll back to.
  • Runs cross-link — to /agents/<deployment-id>/runs so error rates can be compared across versions before promoting (the runs page is keyed on deployment id, so each version is its own runs window).
  • Per-row actionspromote to canary, promote to stable, rollback to this. Gated by actionGateFor so each version sees only the actions that make sense for its current role.

Confirm dialog on every blast-radius action — the dialog copy repeats the sticky-by-conversation invariant ("rollback does NOT yank in-flight runs"; "pinned-to-canary conversations finish on their channel") exactly when the operator is about to commit. No silent commits.

Channel-state banner (stable: v3 (90%) · canary: v4 (10%)) on the chat page (/agents/[id]), Suspense-wrapped so a slow / 5xx versions endpoint doesn't block the chat shell.

Auth gate (defense in depth). Client-side: callerCanMutate = ELEVATED_ROLES.has(callerRole) reads the membership role from the session DTO; non-mutating callers see "owner / admin only" instead of buttons. Server-side: every mutation in api/src/routes/agent-versions.ts is gated to ROLES_ELEVATED, so a tampered client gets 403 from the proxy. The dashboard gate is purely a UX courtesy.

Webhooks #

Four event types fire from the channel mutation seam. Same envelope shape as every other agent_run.* / deployment.* event in webhooks.md. Fire-and-forget — a webhook delivery failure cannot fault the upstream channel mutation.

Event type When it fires Notes
agent_version.deployed upload-time, after mintAgentVersionOnDeploy pair with deployment.created for live-machine reachability
agent_version.promoted_to_canary PUT /channels/canary first-attach has previousDeploymentId: null
agent_version.promoted_to_stable PUT /channels/stable green-metrics canary graduation
agent_version.rolled_back POST /rollback distinct from promoted_to_stable so SIEM tooling can branch ("production was reverted" vs "canary graduated")

DELETE /channels/canary deliberately does not emit an event — clearing isn't a promotion.

Payload shape (agent_version.promoted_to_canary / _to_stable):

{
  "id": "1a7f3e22-8c91-4a02-b7d4-3e1f8c0a2b54",
  "type": "agent_version.promoted_to_canary",
  "createdAt": "2026-05-09T08:25:42.310Z",
  "organizationId": "org_2t4b...",
  "data": {
    "agentName": "support-triage",
    "channelName": "canary",
    "versionLabel": "v4",
    "deploymentId": "dep_4kp...",
    "trafficWeight": 0.1,
    "actorUserId": "usr_9q1...",
    "previousDeploymentId": null
  }
}

previousDeploymentId is captured before the upsert so receivers see the real diff, not the post-state. The full set of example bodies (plus the agent_version.deployed and agent_version.rolled_back shapes) lives in webhooks.md.

A subscriber can wire CI/CD: auto-promote on green metrics, post a Slack message on rollback, pipe rollback events to PagerDuty as a "production was reverted" signal. Subscribe with ["agent_version.deployed", "agent_version.promoted_to_canary", "agent_version.promoted_to_stable", "agent_version.rolled_back"] or the wildcard ["*"].

Operator runbook #

Sticky-by-conversation in plain terms #

A conversation that started on stable keeps hitting stable for the rest of its life — even after you canary-promote, even after you rollback. The same is true for canary: a conversation pinned to v4 finishes on v4 even after you canary remove or canary promote, until the conversation ends or the deployment's Fly machine is destroyed.

This is desirable, not a bug. Mid-conversation state shouldn't get destroyed by an admin's rollback. The dashboard's rollback confirm dialog calls this out: "N active conversations on the previous version will continue there until they end."

If you need to forcibly drain a version, destroy its deployment (the Fly machine goes away; the next agent_messages write fails; the client surfaces an error). There's no soft-drain primitive in v1.

When to use canary vs direct stable #

  • Canary — when you want to validate a new version against real traffic before committing. Set canary at 10% (the default), watch the per-version metrics strip on the dashboard for a few hours, then canary promote if green or canary remove if not. Sticky-by-conv means a small slice of users sees the new behaviour consistently.
  • Direct stable (PUT /channels/stable or "promote to stable" on the dashboard) — when you've already validated the version externally (e.g. extensive stech dev testing) and the canary step is overkill. Skips the validation window; all NEW conversations route to the new version immediately.
  • Rollback — when something is on fire. One step back is stech rollback <agent>; arbitrary is --to=v2. Faster than a re-deploy because the Fly machine is already running.

The "deploy a new version while canary is at 10%" case #

A fresh stech deploy always lands as a new version, never auto- attached to a channel. The previous canary version becomes a non- pointed version (kept per retention policy below). To shift traffic to the new version, run stech canary set <agent> <new-label> --weight=10 explicitly. This makes deploys safe and channel changes deliberate.

Retention #

Per the epic, v1 keeps all versions with traffic + the last 10 idle versions per agent. The cap is hardcoded today; a future org_settings.version_retention_idle knob will make it per-org configurable. When a version ages out, its Fly machine is destroyed via the same createFlyDestroyFn path the supersede-and-destroy flow uses today.

Diagnosing a stuck conversation #

If a user reports they're seeing the old behaviour after a rollback:

  1. Confirm their conversationId from the run history (/agents/[id]/runs).
  2. The router pinned them to whichever channel was active when the conversation started. They will continue there until the conversation ends.
  3. Have them start a new conversation; the new bucket pick will land on the rolled-back version per the current weights.

If a webhook receiver claims it didn't see a promotion:

  1. Check the event subscription includes the right agent_version.* types (or ["*"]).
  2. Check the webhook_deliveries table for the event_id — see webhooks.md.
  3. DELETE /channels/canary does NOT emit — if you're listening for "canary cleared", listen for agent_version.promoted_to_stable (which clears canary as part of canary promote) or agent_version.rolled_back (which clears canary as part of the rollback transaction) instead.

Limitations #

  • No auto-rollback on metric breach. A "canary error rate >5% for 10 min → revert" trigger is filed as a future epic; needs a metrics- driven control loop, which is observability + alerts territory.
  • No A/B testing within one version. Running two configs concurrently against the same conversation, measuring, and picking a winner is a separate epic.
  • No rolling deploys. Gradually shifting weight automatically over a window (e.g. 0% → 10% → 50% → 100% over an hour) is filed separately. Today: operator sets weight, watches, sets again.
  • No multi-region routing. Canary in one region while stable serves the others is filed separately.
  • No per-customer routing. "Customer X always lands on canary" is filed separately.
  • Weight slider on the dashboard. First-set lands at the locked 10% default; rebalancing requires a CLI roundtrip (stech canary set <agent> <label> --weight=20) today.
  • Cache eventual-consistency. The 5s TTL on the per-(org, agent) channels cache means a cross-replica view of a fresh canary set can lag by up to 5s. Documented; same posture as the rate- limiter and oauth-state-store caches.
  • Fly machine count grows. Each version = a machine (or N machines per channel). Cost scales with retention. The default cap of 10 idle versions plus all-versions-with-traffic is the v1 ceiling.

Cross-references #

  • agent-runs.md — the run-history surface. deploymentId filters on the runs page key the per-version metrics cross-link.
  • observability.md — the per-deployment metrics the dashboard's runs cross-link points at; useful for comparing error rates across versions before promoting.
  • webhooks.md — envelope, signing scheme, retry policy. The agent_version.* events use the same delivery path as every other event.
  • billing-and-usage.md — caps apply per-org regardless of version; a runaway canary still counts against the org's cap_runs / cap_input_tokens.
  • policy-and-guardrails.md — guardrails are per-deployment (declared on defineAgent()), so a new version with relaxed guardrails inherits its own policy. A rollback to a stricter guardrail set takes effect for the next conversation that lands on the rolled-back version.