Agent versioning
The versioning surface lets operators promote, canary, and roll back
agent deployments without re-provisioning the underlying Fly machine.
A stech deploy mints a new immutable version; channel pointers
(stable / canary) decide which version takes traffic. Rollback is
a row-update on the channel pointer, not a re-deploy — the previous
version's machine keeps running and traffic flips back instantly.
This page is the catalog: the model, the schema, the API surface, the CLI surface, the dashboard, the lifecycle webhooks, and the operator runbook for the sticky-by-conversation invariant that governs in-flight runs.
Why #
Before this epic, stech deploy did a hard swap: the new deployment
flipped to live and every prior live row for the same (org, agent) was marked superseded and its Fly machine destroyed (epic
#178).
A bad deploy meant 100% of traffic instantly hit the broken version,
and the previous-known-good was gone. There was no canary, no
rollback.
Versioning (#297)
adds the lever to act on that signal: deployments stay immutable,
only channel pointers move. A bad version doesn't have to be rebuilt
from source — stech rollback flips stable back to the previous
agent_versions row in one txn.
The model #
Three concepts, layered on top of the existing deployments table:
- Deployment — the immutable artefact (tarball + Fly machine
config) at a specific
agent.tspoint-in-time. Already exists asdeploymentsrows; identity isid(cuid2). Versioning does not touch this table. - Version — a human-friendly label (
v1,v2,v3.5-canary) attached to a deployment for memorability and rollback. Auto- assignedv<N>per agent at deploy time; manual overlay viastech version label. Lives inagent_versions. - Channel — a logical pointer (
stable/canary) that maps an agent to a deployment plus atraffic_weight(0..1). The "live deployment for this agent" concept becomes a derived view ofagent_channelsrows wherechannel_name='stable'. Lives inagent_channels.
Critically: deployments stay immutable; only channel pointers
move. Rolling back is a row-update on agent_channels, not a
re-provision. The Fly machine for the previous version stays running
(subject to retention) so traffic can flip back instantly.
A fresh stech deploy always lands as a new version, never
auto-attached to a channel. The operator promotes explicitly via
stech canary set / stech canary promote. This makes deploys safe
("just deploy, traffic doesn't move") and channel changes deliberate
("traffic moves only when I move it").
Schema #
Two tables in db/src/schema/ — see
agent-versions.ts and
agent-channels.ts.
agent_versions #
| Column | Type | Notes |
|---|---|---|
id |
cuid2 PK | |
organization_id |
FK → organizations.id (cascade) |
denormalised so (org, agent_name) is single-table |
agent_name |
text | matches deployments.agent_name |
deployment_id |
FK → deployments.id (cascade) |
the version is meaningless without the deployment |
label |
text NOT NULL | auto-assigned v<N> per agent; manual overlay via stech version label |
notes |
text NULL | free-form, surfaced in dashboard list |
created_by_user_id |
FK → users.id (set null) |
audit history survives operator churn |
created_at, updated_at |
timestamptz |
UNIQUE on (organization_id, agent_name, label) — two versions of the
same agent can't collide on label. Index on (organization_id, agent_name, created_at DESC) for the version-list query.
agent_channels #
| Column | Type | Notes |
|---|---|---|
id |
cuid2 PK | |
organization_id |
FK → organizations.id (cascade) |
|
agent_name |
text | |
channel_name |
text NOT NULL | CHECK-locked to 'stable' | 'canary' |
deployment_id |
FK → deployments.id (set null) |
nullable so a cleared canary keeps its row |
traffic_weight |
numeric(4,3) NOT NULL | CHECK 0.0 ≤ w ≤ 1.0; resolution down to 0.001 (0.1% canary) |
updated_by_user_id |
FK → users.id (set null) |
audit history |
created_at, updated_at |
timestamptz |
UNIQUE on (organization_id, agent_name, channel_name). Index on
(organization_id, agent_name) for the run-stream router's hot-path
lookup.
Invariants #
channel_namevocabulary is closed at write time. The CHECK refuses typos like'staging'so a misnamed channel can't silently never-route. Widen the IN-list (no row migration) if more channels land.traffic_weight∈ [0, 1]. The CHECK refuses1.5so 150% of routing dice can't go to one channel. The api also enforcescanary.weight ≤ 0.5("canary stays small") at the route layer.deployment_idnullable on channels. A cleared canary keeps the row (audit + future re-set without losingcreated_at); the router treatsNULLas "channel inactive, do not consider for routing".- No auto-attach on deploy.
mintAgentVersionOnDeploywrites toagent_versionsonly;agent_channelsis operator-driven.
Backfill #
The schema migration (db/drizzle/0027_agent_versioning.sql) is DDL-
only. The data backfill — for each existing live deployment, mint a
v<N> versions row + a stable channel at weight 1.0 — lives in
api/scripts/backfill-agent-channels.ts.
Idempotent: skips agents that already have a stable channel. Run as
bun --filter '@stech/api' run backfill-agent-channels.
Routing #
Run-stream entry resolves (org, agent_name) from the requested
deployment id, then consults agent_channels to pick a target by
weighted hash. See
api/src/lib/agent-router.ts.
- Bucketing.
bucketFor(key)= SHA-256 first-4-byte uint32 mod 10_000. Deterministic, distributes uniformly.pickChannelwalks routable channels in lexicographic name order (canary<stable), accumulates weights, picks the first cumulative-mass-passing channel. - Sticky-by-conversation. The hash key is the
conversationIdalone — once a conversation lands on a channel, all subsequent runs in that conversation hit the same channel for the lifetime of the conversation, even if weights change later. One-shot mode (noconversationId) hashes a freshcrypto.randomUUID()so each run picks independently and the weight ratio holds over many calls. - Cache. In-process Map, 5s TTL per
(org, agent). Mutation endpoints callinvalidateRoutingCacheFor(org, agent)so acanary setpropagates faster than the TTL. Multi-replica divergence is bounded by the TTL — eventually consistent. - Fallback. No channels → fall through to the requested deployment id. Legacy / un-backfilled / cleared-channels states don't go dark.
- Conversation history. Persisted-thread mode reads/writes
agent_messagesagainst the conversation, not the deployment, so prior messages survive a channel flip. A canary version receiving a conversation that was previously on stable inherits whatever message history the previous version produced.
API #
All routes under /v1/orgs/:slug/agents/:agentName/.... GETs are
any-member; mutations require an elevated role (owner / admin), same
gate as stech deploy. See
api/src/routes/agent-versions.ts.
| Method | Path | Auth | Purpose |
|---|---|---|---|
| GET | /versions |
any-member | list versions + active channels for an agent |
| POST | /versions/:label |
elevated | overlay a manual label on a deployment |
| GET | /channels |
any-member | list channels (stable, canary, …) |
| PUT | /channels/:name |
elevated | set deployment + weight on a channel |
| DELETE | /channels/:name |
elevated | clear a channel (canary only; stable refused) |
| POST | /rollback |
elevated | flip stable to a previous version atomically |
Validation:
name ∈ {stable, canary}(matches the DB CHECK so a typo surfaces as 400, not a Postgres 500).weight ∈ [0, 1]; forname='canary',weight ≤ 0.5.DELETE /channels/stableis refused — stable is the implicit fall- through; clearing it would leave the agent unreachable.
PUT /channels/:name accepts either versionLabel (preferred) or
deploymentId in the body and resolves to a deployment_id server-
side. Each mutation calls invalidateRoutingCacheFor and fires a
lifecycle webhook (see Webhooks).
POST /rollback accepts { to?: string } — omit for "one step back
by created_at", supply a label for arbitrary. Stable upsert + canary
clear run in a single transaction. Refuses with 404
no_rollback_target if there's no previous version (or the named
label doesn't exist).
Auto-mint on deploy #
POST /v1/orgs/:slug/agents (the upload route) calls
mintAgentVersionOnDeploy after the deployments INSERT. Sequential
v<N> per agent; manual labels (v3.5-canary) are skipped when
computing the next ordinal so the auto sequence stays predictable. A
UNIQUE collision retries once on the next ordinal; a persistent
collision logs and the deploy still succeeds — versions are metadata,
the deployment row is the source of truth for "is the agent live?".
The mint fires agent_version.deployed; it does not touch
agent_channels.
CLI #
Three top-level command families, defined in
cli/src/runtime/versioning.ts.
Each mutating command re-fetches channels post-mutation and prints the
stable: v3 (90%) · canary: v4 (10%) one-liner so the operator sees
the resulting state without a follow-up stech version list.
stech version #
stech version list <agent> [--json]
Show every version of <agent>; marks active channels.
stech version label <agent> <deployment-id> <label>
Overlay a memorable label (e.g. 'v3.5-canary') on a
deployment. Replaces the auto-assigned 'v<N>' label.stech version list prints a table with columns label, deployment
(short id), status (live / failed / superseded / destroyed),
channel (which channel(s) point here), created. The active-channel
header line above the table mirrors the post-mutation confirmation.
stech canary #
stech canary set <agent> <version> --weight=<N>
Point canary at <version> with N% of new traffic.
<version> may be a label (v3) or a deployment id.
--weight is 1..50 (canary stays small; use 'promote' to swap).
stech canary promote <agent>
Atomic-ish: canary becomes stable; canary channel is cleared.
stech canary remove <agent>
Clear the canary channel; all NEW conversations route to stable.--weight is integer percent (1..50) in the CLI for ergonomics;
the CLI converts to a 0..1 fraction before calling the API. Values
50 are refused — that's not a canary, that's a swap-in-disguise; use
canary promoteinstead.
canary promote composes two existing primitives client-side: GET
channels → PUT /channels/stable with the canary's deployment at
weight 1.0 → DELETE /channels/canary. The PUT runs first so a
failure between the two steps leaves stable correctly pointing at the
formerly-canary deployment (the worse failure mode would be canary-
cleared-but-stable-still-old).
stech rollback #
stech rollback <agent>
Revert stable to the previous version (one step back).
stech rollback <agent> --to=<version>
Revert stable to an arbitrary previous version (label).The api runs the rollback in one transaction (stable upsert + canary clear). Existing conversations pinned to the previously-stable version finish on it per the sticky-by-conversation invariant; only NEW conversations route to the rolled-back version.
Dashboard #
/agents/[id]/versions — newest-first list of versions for an agent.
Source files in
app/app/agents/[id]/versions/.
Per-row columns:
- Label (
v3,v4, etc.) and created at / created by. - Channel badges — inline
stable · 100%/canary · 10%on the row that channel currently points at, with weight tags. - Deployment-status pill —
live/failed/superseded/destroyed. An operator can see at a glance whether a version's machine is still around to roll back to. - Runs cross-link — to
/agents/<deployment-id>/runsso error rates can be compared across versions before promoting (the runs page is keyed on deployment id, so each version is its own runs window). - Per-row actions —
promote to canary,promote to stable,rollback to this. Gated byactionGateForso each version sees only the actions that make sense for its current role.
Confirm dialog on every blast-radius action — the dialog copy repeats the sticky-by-conversation invariant ("rollback does NOT yank in-flight runs"; "pinned-to-canary conversations finish on their channel") exactly when the operator is about to commit. No silent commits.
Channel-state banner (stable: v3 (90%) · canary: v4 (10%)) on
the chat page (/agents/[id]), Suspense-wrapped so a slow / 5xx
versions endpoint doesn't block the chat shell.
Auth gate (defense in depth). Client-side: callerCanMutate = ELEVATED_ROLES.has(callerRole) reads the membership role from the
session DTO; non-mutating callers see "owner / admin only" instead of
buttons. Server-side: every mutation in
api/src/routes/agent-versions.ts
is gated to ROLES_ELEVATED, so a tampered client gets 403 from the
proxy. The dashboard gate is purely a UX courtesy.
Webhooks #
Four event types fire from the channel mutation seam. Same envelope
shape as every other agent_run.* / deployment.* event in
webhooks.md. Fire-and-forget — a webhook delivery
failure cannot fault the upstream channel mutation.
| Event type | When it fires | Notes |
|---|---|---|
agent_version.deployed |
upload-time, after mintAgentVersionOnDeploy |
pair with deployment.created for live-machine reachability |
agent_version.promoted_to_canary |
PUT /channels/canary |
first-attach has previousDeploymentId: null |
agent_version.promoted_to_stable |
PUT /channels/stable |
green-metrics canary graduation |
agent_version.rolled_back |
POST /rollback |
distinct from promoted_to_stable so SIEM tooling can branch ("production was reverted" vs "canary graduated") |
DELETE /channels/canary deliberately does not emit an event —
clearing isn't a promotion.
Payload shape (agent_version.promoted_to_canary / _to_stable):
{
"id": "1a7f3e22-8c91-4a02-b7d4-3e1f8c0a2b54",
"type": "agent_version.promoted_to_canary",
"createdAt": "2026-05-09T08:25:42.310Z",
"organizationId": "org_2t4b...",
"data": {
"agentName": "support-triage",
"channelName": "canary",
"versionLabel": "v4",
"deploymentId": "dep_4kp...",
"trafficWeight": 0.1,
"actorUserId": "usr_9q1...",
"previousDeploymentId": null
}
}previousDeploymentId is captured before the upsert so receivers
see the real diff, not the post-state. The full set of example bodies
(plus the agent_version.deployed and agent_version.rolled_back
shapes) lives in webhooks.md.
A subscriber can wire CI/CD: auto-promote on green metrics, post a
Slack message on rollback, pipe rollback events to PagerDuty as a
"production was reverted" signal. Subscribe with
["agent_version.deployed", "agent_version.promoted_to_canary", "agent_version.promoted_to_stable", "agent_version.rolled_back"] or
the wildcard ["*"].
Operator runbook #
Sticky-by-conversation in plain terms #
A conversation that started on stable keeps hitting stable for the
rest of its life — even after you canary-promote, even after you
rollback. The same is true for canary: a conversation pinned to v4
finishes on v4 even after you canary remove or canary promote,
until the conversation ends or the deployment's Fly machine is
destroyed.
This is desirable, not a bug. Mid-conversation state shouldn't get destroyed by an admin's rollback. The dashboard's rollback confirm dialog calls this out: "N active conversations on the previous version will continue there until they end."
If you need to forcibly drain a version, destroy its deployment (the
Fly machine goes away; the next agent_messages write fails; the
client surfaces an error). There's no soft-drain primitive in v1.
When to use canary vs direct stable #
- Canary — when you want to validate a new version against real
traffic before committing. Set canary at 10% (the default), watch
the per-version metrics strip on the dashboard for a few hours, then
canary promoteif green orcanary removeif not. Sticky-by-conv means a small slice of users sees the new behaviour consistently. - Direct stable (
PUT /channels/stableor "promote to stable" on the dashboard) — when you've already validated the version externally (e.g. extensivestech devtesting) and the canary step is overkill. Skips the validation window; all NEW conversations route to the new version immediately. - Rollback — when something is on fire. One step back is
stech rollback <agent>; arbitrary is--to=v2. Faster than a re-deploy because the Fly machine is already running.
The "deploy a new version while canary is at 10%" case #
A fresh stech deploy always lands as a new version, never auto-
attached to a channel. The previous canary version becomes a non-
pointed version (kept per retention policy below). To shift traffic
to the new version, run stech canary set <agent> <new-label> --weight=10 explicitly. This makes deploys safe and channel changes
deliberate.
Retention #
Per the epic, v1 keeps all versions with traffic + the last 10 idle
versions per agent. The cap is hardcoded today; a future
org_settings.version_retention_idle knob will make it per-org
configurable. When a version ages out, its Fly machine is destroyed
via the same createFlyDestroyFn path the supersede-and-destroy flow
uses today.
Diagnosing a stuck conversation #
If a user reports they're seeing the old behaviour after a rollback:
- Confirm their
conversationIdfrom the run history (/agents/[id]/runs). - The router pinned them to whichever channel was active when the conversation started. They will continue there until the conversation ends.
- Have them start a new conversation; the new bucket pick will land on the rolled-back version per the current weights.
If a webhook receiver claims it didn't see a promotion:
- Check the event subscription includes the right
agent_version.*types (or["*"]). - Check the
webhook_deliveriestable for theevent_id— see webhooks.md. DELETE /channels/canarydoes NOT emit — if you're listening for "canary cleared", listen foragent_version.promoted_to_stable(which clears canary as part ofcanary promote) oragent_version.rolled_back(which clears canary as part of the rollback transaction) instead.
Limitations #
- No auto-rollback on metric breach. A "canary error rate >5% for 10 min → revert" trigger is filed as a future epic; needs a metrics- driven control loop, which is observability + alerts territory.
- No A/B testing within one version. Running two configs concurrently against the same conversation, measuring, and picking a winner is a separate epic.
- No rolling deploys. Gradually shifting weight automatically over a window (e.g. 0% → 10% → 50% → 100% over an hour) is filed separately. Today: operator sets weight, watches, sets again.
- No multi-region routing. Canary in one region while stable serves the others is filed separately.
- No per-customer routing. "Customer X always lands on canary" is filed separately.
- Weight slider on the dashboard. First-set lands at the locked
10% default; rebalancing requires a CLI roundtrip (
stech canary set <agent> <label> --weight=20) today. - Cache eventual-consistency. The 5s TTL on the per-
(org, agent)channels cache means a cross-replica view of a freshcanary setcan lag by up to 5s. Documented; same posture as the rate- limiter and oauth-state-store caches. - Fly machine count grows. Each version = a machine (or N machines per channel). Cost scales with retention. The default cap of 10 idle versions plus all-versions-with-traffic is the v1 ceiling.
Cross-references #
- agent-runs.md — the run-history surface.
deploymentIdfilters on the runs page key the per-version metrics cross-link. - observability.md — the per-deployment metrics the dashboard's runs cross-link points at; useful for comparing error rates across versions before promoting.
- webhooks.md — envelope, signing scheme, retry
policy. The
agent_version.*events use the same delivery path as every other event. - billing-and-usage.md — caps apply
per-org regardless of version; a runaway canary still counts against
the org's
cap_runs/cap_input_tokens. - policy-and-guardrails.md —
guardrails are per-deployment (declared on
defineAgent()), so a new version with relaxed guardrails inherits its own policy. A rollback to a stricter guardrail set takes effect for the next conversation that lands on the rolled-back version.