Epic 9 — In-App Observability

Covers per-run token usage, per-stage latency, MCP cache-hit ratio (Valkey amtp:mcp:tree:* and amtp:mcp:blob:* key spaces), and deep-links from the run timeline into Grafana/Tempo panels. This is a read-only surface; no mutations occur.

Personas: BU (consumer) OP (consumer + Grafana access)

Shared modules: CorrelationChip LastSyncedBadge EnvProvenance

Story 9.1 — Per-Run Token Usage

As a: BU
I want: to see the total LLM token consumption for a run broken down by stage
So that: I can understand the cost profile of each pipeline execution

Scenario: Token usage panel renders on run detail

Giventhe user is viewing /runs/{run_id} for a completed or in-progress run WhenGET /runs/{run_id}/metrics resolves Thena Token Usage panel is rendered (collapsed by default, Show token usage toggle) showing: total tokens (prompt + completion) and a breakdown table with one row per stage: stage name, prompt tokens, completion tokens, total tokens

Scenario: Token data unavailable for running stage

Givena stage's status is still running Thenthe row for that stage shows — in place of token counts and a note reads Stage in progress — token usage will appear when complete

Endpoint / DB	Purpose
`GET /runs/{run_id}/metrics`	Token usage, latency, and cache metrics
DB `stages.status`	Guard for in-progress rows

Story 9.2 — Per-Stage Latency

As a: BU
I want: to see the elapsed time for each stage in a run
So that: I can identify slow stages and estimate future run durations

Scenario: Latency breakdown renders

Giventhe run has at least one completed stage WhenGET /runs/{run_id}/metrics resolves Thena Latency panel displays one row per stage: stage name, elapsed time (finished_at - started_at), attempt count; the longest-elapsed stage is visually highlighted

Scenario: Stage still running — live elapsed counter

Givena stage has status = 'running' Thenthe row shows a live elapsed counter incrementing every second; the final elapsed time replaces the live counter when the stage transitions to a terminal state

Endpoint / DB	Purpose
`GET /runs/{run_id}/metrics`	Includes per-stage latency
DB `stages.started_at`, `stages.finished_at`, `stages.attempt`	Latency source

Story 9.3 — MCP Cache-Hit Ratio

As a: BU
I want: to see the Valkey MCP cache-hit ratio for each run
So that: I can understand how effectively the platform avoids redundant GitHub API calls

Scenario: Cache-hit ratio panel renders

Giventhe user is viewing a run that used the GitHub MCP tool WhenGET /runs/{run_id}/metrics resolves with a mcp_cache block Thenan MCP Cache panel shows: amtp:mcp:tree:* hits/misses/ratio, amtp:mcp:blob:* hits/misses/ratio, and overall cache-hit percentage

Scenario: Cache metrics unavailable — no MCP calls in this run

Giventhe run's stages did not invoke the GitHub MCP tool Thenthe MCP Cache panel shows No MCP calls in this run rather than zero-filled metrics

Story 9.4 — Grafana and Tempo Deep-Links

As an: OP
I want: to open the relevant Grafana panel or Tempo trace for a run directly from the run detail page
So that: I can pivot from the AMTP UI to the observability stack without manually constructing URLs

Scenario: Grafana run panel link renders for OP

Giventhe user has the OP role and GET /config provides the configured Grafana base URL Whenthe run detail page renders Thenan Open in Grafana link is shown in the observability section; the link navigates to the pre-scoped Grafana panel for run_id={run_id} in a new tab

Scenario: Tempo trace link for a specific stage

Giventhe user has the OP role and GET /runs/{run_id}/metrics provides a trace_id for a stage Thena View trace link is shown beside the stage row navigating to the Tempo trace URL scoped to trace_id (new tab)

Scenario: Grafana links absent for BU and AU roles

Giventhe user has BU or Auditor role Thenthe Grafana and Tempo deep-links are absent from the DOM

Endpoint / DB	Purpose
`GET /config`	Grafana base URL, Tempo base URL
`GET /runs/{run_id}/metrics`	Includes `trace_id` per stage