Epic 14 — Admin · System Health & Diagnostics

Covers the service map (Postgres, Valkey, Temporal, GitHub MCP, OTel Collector), /healthz summaries, the Temporal task-queue backlog monitor, degraded-state banners for OTel/Tempo/Prometheus, Grafana/Tempo deep-links, and the Operator panic-button for force-terminating stalled workflows. All stories require the OP role.

Personas: OP (exclusive)

Shared modules: CorrelationChip LastSyncedBadge EnvProvenance

Story 14.1 — Service Map Overview

As an
OP
I want
to see a real-time service map showing the health of each AMTP dependency
So that
I can identify which component is causing a system degradation at a glance
Scenario: All services healthy — service map renders green
Giventhe Operator navigates to /admin/health WhenGET /healthz resolves with all components healthy Thenthe service map renders nodes for Postgres, Valkey, Temporal (cluster + workers), GitHub MCP (amtp-github-mcp), and OTel Collector; each node shows a green Healthy badge with last-checked timestamp; LastSyncedBadge reads Synced at: <time> · Snapshot
Scenario: A component is degraded — amber badge
GivenGET /healthz returns Valkey with status: "degraded" Thenthe Valkey node shows an amber Degraded badge with a description of the degraded condition; a View metrics link opens the Grafana Valkey panel
Scenario: A component is unreachable — red badge
GivenGET /healthz returns Postgres with status: "unreachable" Thenthe Postgres node shows a red Unreachable badge; CorrelationChip shows the request_id for the health poll
Endpoint / DB Purpose
GET /healthz Full health summary for all components
GET /config Grafana base URL for deep-links

Story 14.2 — Temporal Task-Queue Backlog Monitor

As an
OP
I want
to see the queue depth and schedule-to-start P95 latency for the amtp-activities Temporal task queue
So that
I can distinguish "workers are online" from "the system is actually making progress"
Scenario: Task queue healthy — Temporal node renders green
Giventhe report shows workers reachable, queue depth ≤ 50, and schedule-to-start P95 ≤ 30 s Thenthe Temporal node is green with a Healthy badge; the queue depth and P95 values are shown as sub-metrics beneath the node
Scenario: High task latency — amber badge
Givenworkers are reachable but queue depth > 50 OR schedule-to-start P95 > 30 s Thenan amber High task latency badge is shown; copy reads: Workers are responsive but the task backlog is growing. Current depth: <N>, P95 schedule-to-start: <value> s.; a View Temporal metrics link opens the Grafana Temporal panel
Scenario: Workers down — red badge, distinct copy
GivenTemporal workers are reported as unreachable Thena red Workers unavailable badge is shown with copy: Temporal workers are not reachable. No activities will be processed. (distinct from "high latency" case)
Scenario: Thresholds are configurable
Giventhe runtime config endpoint returns a custom depth threshold (e.g. 100) Thenthe amber warning fires at the configured threshold, not the hardcoded default of 50
Endpoint / DB Purpose
GET /healthz Includes Temporal task-queue metrics
GET /ops/temporal-queue Dedicated queue-depth and P95 endpoint
GET /config Configurable queue-depth and P95 thresholds

Story 14.3 — Operator Panic-Button (Force Terminate Stalled Workflow)

As an
OP
I want
a confirmed, audited action to force-terminate a stalled workflow via the API Gateway
So that
I can unblock the system without bypassing AMTP auth or accessing Temporal directly
Scenario: Panic-button visible only above Critical P95 threshold
Giventhe schedule-to-start P95 exceeds the configurable Critical threshold (default 5 min) Thenan OP-only Force terminate stalled workflow action appears next to the High task latency badge; a separate read-only Open Temporal Web UI link is also provided for inspection
Scenario: Panic-button requires 2-step confirmation modal
Giventhe Operator clicks Force terminate stalled workflow Whenthe confirmation modal opens Thenthe modal displays the target run_id, current queue depth at decision time, current schedule-to-start P95 at decision time, and CorrelationChip; a Confirm terminate button and a Cancel button are rendered
Scenario: Termination issued via API Gateway — not direct Temporal API
Giventhe Operator has confirmed the termination modal ThenPOST /ops/runs/{run_id}/terminate is issued to the AMTP API Gateway (which proxies to Temporal WorkflowTerminate server-side); an Epic 15 audit-log entry is written with run_id, queue_depth, p95_at_decision, and the Operator's identity; no direct browser call to Temporal's Terminate API is ever made
Scenario: Panic-button also surfaces on Activity Heartbeat Timeout stage card (Epic 4 cross-link)
Givena stage has entered Activity Heartbeat Timeout state (Story 4.5) and the P95 is above the Critical threshold and the viewing user has the OP role Thenthe Force terminate stalled workflow button appears on the offending stage card; clicking it opens the same 2-step confirmation modal
Endpoint / DB Purpose
POST /ops/runs/{run_id}/terminate AMTP API Gateway proxied termination (OP scope required)
GET /ops/temporal-queue P95 value at decision time (for modal display)

Story 14.4 — OTel / Tempo / Prometheus Degraded States

As an
OP
I want
the service map to surface degraded observability-stack states with specific copy
So that
I know whether Grafana deep-links and trace data are trustworthy before acting on them
Scenario: Prometheus unreachable — metric deep-links show stale-data warning
GivenGET /healthz returns Prometheus with status: "unreachable" ThenGrafana metric deep-links are still shown but carry a tooltip: Metrics source unavailable — data may be stale; the Prometheus node shows a red Unreachable badge
Scenario: Tempo unreachable — trace links show warning
GivenGET /healthz returns Tempo with status: "unreachable" ThenView trace links are shown but prefixed with a Tempo unavailable tooltip warning
Scenario: OTel Collector unreachable — global banner
GivenGET /healthz returns OTel Collector with status: "unreachable" Thena global banner reads: Telemetry pipeline is down. Trace and metric data is not being collected. Investigate the OTel Collector service.; local-dev hints (docker compose ps otel-collector) are shown if EnvProvenance resolves to development; production remediation hints are shown for production (distinct copy per Dev-Prod Parity NFR)
Endpoint / DB Purpose
GET /healthz Includes otel_collector, tempo, prometheus component statuses