Epic 14 — Admin · System Health & Diagnostics
Covers the service map (Postgres, Valkey, Temporal, GitHub MCP, OTel
Collector), /healthz summaries, the Temporal task-queue
backlog monitor, degraded-state banners for OTel/Tempo/Prometheus,
Grafana/Tempo deep-links, and the Operator panic-button for
force-terminating stalled workflows.
All stories require the OP role.
Personas: OP (exclusive)
Shared modules:
CorrelationChip
LastSyncedBadge
EnvProvenance
Story 14.1 — Service Map Overview
- As an
- OP
- I want
- to see a real-time service map showing the health of each AMTP dependency
- So that
- I can identify which component is causing a system degradation at a glance
Scenario: All services healthy — service map renders green
Giventhe Operator navigates to
/admin/health
WhenGET /healthz resolves with all components
healthy
Thenthe service map renders nodes for Postgres, Valkey, Temporal
(cluster + workers), GitHub MCP (amtp-github-mcp), and
OTel Collector; each node shows a green Healthy badge with
last-checked timestamp; LastSyncedBadge reads
Synced at: <time> · Snapshot
Scenario: A component is degraded — amber badge
Given
GET /healthz returns Valkey with
status: "degraded"
Thenthe Valkey node shows an amber Degraded badge with a
description of the degraded condition; a View metrics link
opens the Grafana Valkey panel
Scenario: A component is unreachable — red badge
Given
GET /healthz returns Postgres with
status: "unreachable"
Thenthe Postgres node shows a red Unreachable badge;
CorrelationChip shows the request_id for
the health poll
| Endpoint / DB | Purpose |
|---|---|
GET /healthz |
Full health summary for all components |
GET /config |
Grafana base URL for deep-links |
Story 14.2 — Temporal Task-Queue Backlog Monitor
- As an
- OP
- I want
-
to see the queue depth and schedule-to-start P95 latency for the
amtp-activitiesTemporal task queue - So that
- I can distinguish "workers are online" from "the system is actually making progress"
Scenario: Task queue healthy — Temporal node renders green
Giventhe report shows workers reachable, queue depth ≤ 50, and
schedule-to-start P95 ≤ 30 s
Thenthe Temporal node is green with a Healthy badge; the queue
depth and P95 values are shown as sub-metrics beneath the node
Scenario: High task latency — amber badge
Givenworkers are reachable but queue depth > 50 OR schedule-to-start
P95 > 30 s
Thenan amber High task latency badge is shown; copy reads:
Workers are responsive but the task backlog is growing. Current
depth: <N>, P95 schedule-to-start: <value> s.; a View Temporal metrics link opens the Grafana Temporal
panel
Scenario: Workers down — red badge, distinct copy
GivenTemporal workers are reported as unreachable
Thena red Workers unavailable badge is shown with copy:
Temporal workers are not reachable. No activities will be
processed.
(distinct from "high latency" case)
Scenario: Thresholds are configurable
Giventhe runtime config endpoint returns a custom depth threshold (e.g.
100)
Thenthe amber warning fires at the configured threshold, not the
hardcoded default of 50
| Endpoint / DB | Purpose |
|---|---|
GET /healthz |
Includes Temporal task-queue metrics |
GET /ops/temporal-queue |
Dedicated queue-depth and P95 endpoint |
GET /config |
Configurable queue-depth and P95 thresholds |
Story 14.3 — Operator Panic-Button (Force Terminate Stalled Workflow)
- As an
- OP
- I want
- a confirmed, audited action to force-terminate a stalled workflow via the API Gateway
- So that
- I can unblock the system without bypassing AMTP auth or accessing Temporal directly
Scenario: Panic-button visible only above Critical P95 threshold
Giventhe schedule-to-start P95 exceeds the configurable Critical
threshold (default 5 min)
Thenan OP-only Force terminate stalled workflow action appears
next to the High task latency badge; a separate read-only
Open Temporal Web UI link is also provided for
inspection
Scenario: Panic-button requires 2-step confirmation modal
Giventhe Operator clicks Force terminate stalled workflow
Whenthe confirmation modal opens
Thenthe modal displays the target
run_id, current queue
depth at decision time, current schedule-to-start P95 at decision
time, and CorrelationChip; a
Confirm terminate button and a Cancel button are
rendered
Scenario: Termination issued via API Gateway — not direct Temporal API
Giventhe Operator has confirmed the termination modal
Then
POST /ops/runs/{run_id}/terminate is issued to the
AMTP API Gateway (which proxies to Temporal
WorkflowTerminate server-side); an Epic 15 audit-log
entry is written with run_id, queue_depth,
p95_at_decision, and the Operator's identity; no direct
browser call to Temporal's Terminate API is ever made
Scenario: Panic-button also surfaces on Activity Heartbeat Timeout stage card (Epic 4 cross-link)
Givena stage has entered Activity Heartbeat Timeout state (Story 4.5) and the P95 is above the Critical threshold and the viewing user
has the OP role
Thenthe Force terminate stalled workflow button appears on the
offending stage card; clicking it opens the same 2-step confirmation
modal
| Endpoint / DB | Purpose |
|---|---|
POST /ops/runs/{run_id}/terminate |
AMTP API Gateway proxied termination (OP scope required) |
GET /ops/temporal-queue |
P95 value at decision time (for modal display) |
Story 14.4 — OTel / Tempo / Prometheus Degraded States
- As an
- OP
- I want
- the service map to surface degraded observability-stack states with specific copy
- So that
- I know whether Grafana deep-links and trace data are trustworthy before acting on them
Scenario: Prometheus unreachable — metric deep-links show stale-data warning
Given
GET /healthz returns Prometheus with
status: "unreachable"
ThenGrafana metric deep-links are still shown but carry a tooltip:
Metrics source unavailable — data may be stale; the
Prometheus node shows a red Unreachable badge
Scenario: Tempo unreachable — trace links show warning
Given
GET /healthz returns Tempo with
status: "unreachable"
ThenView trace links are shown but prefixed with a
Tempo unavailable tooltip warning
Scenario: OTel Collector unreachable — global banner
Given
GET /healthz returns OTel Collector with
status: "unreachable"
Thena global banner reads:
Telemetry pipeline is down. Trace and metric data is not being
collected. Investigate the OTel Collector service.; local-dev hints (docker compose ps otel-collector)
are shown if EnvProvenance resolves to
development; production remediation hints are shown for
production (distinct copy per Dev-Prod Parity
NFR)
| Endpoint / DB | Purpose |
|---|---|
GET /healthz |
Includes otel_collector, tempo,
prometheus component statuses
|