Epic 14 — Admin · System Health & Diagnostics

Covers the service map (Postgres, Valkey, Temporal, GitHub MCP, OTel Collector), /healthz summaries, the Temporal task-queue backlog monitor, degraded-state banners for OTel/Tempo/Prometheus, Grafana/Tempo deep-links, and the Operator panic-button for force-terminating stalled workflows. All stories require the OP role.

Personas: OP (exclusive)

Shared modules: CorrelationChip LastSyncedBadge EnvProvenance

Story 14.1 — Service Map Overview

As an: OP
I want: to see a real-time service map showing the health of each AMTP dependency
So that: I can identify which component is causing a system degradation at a glance

Scenario: All services healthy — service map renders green

Giventhe Operator navigates to /admin/health WhenGET /healthz resolves with all components healthy Thenthe service map renders nodes for Postgres, Valkey, Temporal (cluster + workers), GitHub MCP (amtp-github-mcp), and OTel Collector; each node shows a green Healthy badge with last-checked timestamp; LastSyncedBadge reads Synced at: <time> · Snapshot

Scenario: A component is degraded — amber badge

GivenGET /healthz returns Valkey with status: "degraded" Thenthe Valkey node shows an amber Degraded badge with a description of the degraded condition; a View metrics link opens the Grafana Valkey panel

Scenario: A component is unreachable — red badge

GivenGET /healthz returns Postgres with status: "unreachable" Thenthe Postgres node shows a red Unreachable badge; CorrelationChip shows the request_id for the health poll

Endpoint / DB	Purpose
`GET /healthz`	Full health summary for all components
`GET /config`	Grafana base URL for deep-links

Story 14.2 — Temporal Task-Queue Backlog Monitor

As an: OP
I want: to see the queue depth and schedule-to-start P95 latency for the amtp-activities Temporal task queue
So that: I can distinguish "workers are online" from "the system is actually making progress"

Scenario: Task queue healthy — Temporal node renders green

Giventhe report shows workers reachable, queue depth ≤ 50, and schedule-to-start P95 ≤ 30 s Thenthe Temporal node is green with a Healthy badge; the queue depth and P95 values are shown as sub-metrics beneath the node

Scenario: High task latency — amber badge

Givenworkers are reachable but queue depth > 50 OR schedule-to-start P95 > 30 s Thenan amber High task latency badge is shown; copy reads: Workers are responsive but the task backlog is growing. Current depth: <N>, P95 schedule-to-start: <value> s.; a View Temporal metrics link opens the Grafana Temporal panel

Scenario: Workers down — red badge, distinct copy

GivenTemporal workers are reported as unreachable Thena red Workers unavailable badge is shown with copy: Temporal workers are not reachable. No activities will be processed. (distinct from "high latency" case)

Scenario: Thresholds are configurable

Giventhe runtime config endpoint returns a custom depth threshold (e.g. 100) Thenthe amber warning fires at the configured threshold, not the hardcoded default of 50

Endpoint / DB	Purpose
`GET /healthz`	Includes Temporal task-queue metrics
`GET /ops/temporal-queue`	Dedicated queue-depth and P95 endpoint
`GET /config`	Configurable queue-depth and P95 thresholds

Story 14.3 — Operator Panic-Button (Force Terminate Stalled Workflow)

As an: OP
I want: a confirmed, audited action to force-terminate a stalled workflow via the API Gateway
So that: I can unblock the system without bypassing AMTP auth or accessing Temporal directly

Scenario: Panic-button visible only above Critical P95 threshold

Giventhe schedule-to-start P95 exceeds the configurable Critical threshold (default 5 min) Thenan OP-only Force terminate stalled workflow action appears next to the High task latency badge; a separate read-only Open Temporal Web UI link is also provided for inspection

Scenario: Panic-button requires 2-step confirmation modal

Giventhe Operator clicks Force terminate stalled workflow Whenthe confirmation modal opens Thenthe modal displays the target run_id, current queue depth at decision time, current schedule-to-start P95 at decision time, and CorrelationChip; a Confirm terminate button and a Cancel button are rendered

Scenario: Termination issued via API Gateway — not direct Temporal API

Giventhe Operator has confirmed the termination modal ThenPOST /ops/runs/{run_id}/terminate is issued to the AMTP API Gateway (which proxies to Temporal WorkflowTerminate server-side); an Epic 15 audit-log entry is written with run_id, queue_depth, p95_at_decision, and the Operator's identity; no direct browser call to Temporal's Terminate API is ever made

Scenario: Panic-button also surfaces on Activity Heartbeat Timeout stage card (Epic 4 cross-link)

Givena stage has entered Activity Heartbeat Timeout state (Story 4.5) and the P95 is above the Critical threshold and the viewing user has the OP role Thenthe Force terminate stalled workflow button appears on the offending stage card; clicking it opens the same 2-step confirmation modal

Endpoint / DB	Purpose
`POST /ops/runs/{run_id}/terminate`	AMTP API Gateway proxied termination (OP scope required)
`GET /ops/temporal-queue`	P95 value at decision time (for modal display)

Story 14.4 — OTel / Tempo / Prometheus Degraded States

As an: OP
I want: the service map to surface degraded observability-stack states with specific copy
So that: I know whether Grafana deep-links and trace data are trustworthy before acting on them

Scenario: Prometheus unreachable — metric deep-links show stale-data warning

GivenGET /healthz returns Prometheus with status: "unreachable" ThenGrafana metric deep-links are still shown but carry a tooltip: Metrics source unavailable — data may be stale; the Prometheus node shows a red Unreachable badge

Scenario: Tempo unreachable — trace links show warning

GivenGET /healthz returns Tempo with status: "unreachable" ThenView trace links are shown but prefixed with a Tempo unavailable tooltip warning

Scenario: OTel Collector unreachable — global banner

GivenGET /healthz returns OTel Collector with status: "unreachable" Thena global banner reads: Telemetry pipeline is down. Trace and metric data is not being collected. Investigate the OTel Collector service.; local-dev hints (docker compose ps otel-collector) are shown if EnvProvenance resolves to development; production remediation hints are shown for production (distinct copy per Dev-Prod Parity NFR)

Endpoint / DB	Purpose
`GET /healthz`	Includes `otel_collector`, `tempo`, `prometheus` component statuses