Observability

AMTP ships a full observability stack deployed as a separate Docker Compose project (docker-compose.observability.yml). LLM agents emit structured OpenTelemetry traces and token-count metrics through an OTel Collector, which forwards traces to Grafana Tempo and metrics to Prometheus. Grafana aggregates both signals into a pre-built LLM Token Accounting dashboard and an alerting pipeline that fires to Slack and email.

Pipeline Overview #

The end-to-end signal flow from agent execution to Grafana visualization:

LLM Agents (OTLP HTTP :4318 / gRPC :4317)
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│  OpenTelemetry Collector  (otel/opentelemetry-collector-contrib:0.100.0)  │
│                                                         │
│  receivers:  otlp (grpc :4317, http :4318)              │
│  processors: memory_limiter → attributes/llm_meta       │
│              → batch                                     │
│  exporters:  otlp/tempo (traces) · prometheus (metrics) │
└───────────┬────────────────────────┬────────────────────┘
            │ traces (OTLP gRPC)     │ metrics (Prometheus scrape :8889)
            ▼                        ▼
   ┌─────────────────┐     ┌───────────────────────┐
   │  Grafana Tempo  │     │  Prometheus 2.52       │
   │  2.4.1          │     │  (amtp-prometheus)     │
   │  HTTP :3200     │     │  :9091 (host)          │
   │  gRPC :9095     │     │  15 d retention        │
   └────────┬────────┘     └──────────┬────────────┘
            │ TraceQL               │ PromQL
            └──────────┬────────────┘
                       ▼
              ┌─────────────────┐
              │  Grafana 10.4.2 │
              │  :3000 (host)   │
              │  Datasources:   │
              │    Tempo · Prom │
              │  Dashboard:     │
              │    LLM Token    │
              │    Accounting   │
              └─────────────────┘

AMTP observability signal flow — docker-compose.observability.yml

Service	Image	Role	Host port(s)
`otel-collector`	`otel/opentelemetry-collector-contrib:0.100.0`	Receive, process, and route all telemetry	`${OBS_OTEL_GRPC_PORT:-4317}`, `${OBS_OTEL_HTTP_PORT:-4318}`, `${OBS_OTEL_METRICS_PORT:-8888}`, `${OBS_OTEL_HEALTH_PORT:-13133}`, `${OBS_OTEL_ZPAGES_PORT:-55679}`
`tempo`	`grafana/tempo:2.4.1`	Distributed trace storage & query	`${OBS_TEMPO_HTTP_PORT:-3200}`, `${OBS_TEMPO_GRPC_PORT:-9095}`
`amtp-prometheus`	`prom/prometheus:v2.52.0`	Metrics storage, scraping, & alerting	`${OBS_PROMETHEUS_PORT:-9091}`→`9090`
`grafana`	`grafana/grafana:10.4.2`	Visualization, dashboards, alerting UI	`${OBS_GRAFANA_PORT:-3000}`

OpenTelemetry Collector #

The Collector is the single ingestion point for all AMTP telemetry. Configuration: infra/observability/otel-collector-config.yaml.

No in-container healthcheck. The OTel Collector image (otel/opentelemetry-collector-contrib) is distroless — it contains no shell, wget, curl, or nc. Any Docker CMD-based healthcheck would exit with status 127 and flip the container to (unhealthy). Readiness is verified functionally: CI emits a real OTLP trace via apps/otel-verify/verify_trace.py and confirms the trace appears in Tempo.

Receivers #

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["http://*", "https://*"]

infra/observability/otel-collector-config.yaml § receivers

Processors #

Processor	Purpose
`memory_limiter`	Hard cap at 512 MiB; spike limit 128 MiB; checked every 1 s.
`attributes/llm_meta`	Upserts `telemetry.sdk.name=opentelemetry`; inserts `pipeline.source=zt-amtp` on every span for downstream filtering.
`batch`	Aggregates spans into batches (timeout 1 s, size 1024, max 2048) before forwarding to reduce network overhead.

Pipeline execution order: memory_limiter → attributes/llm_meta → batch.

Exporters #

Exporter	Target	Notes
`otlp/tempo`	`tempo:4317` (gRPC, insecure)	Retry on failure: initial 1 s, max 30 s, elapsed 120 s. Queue: 4 consumers, 1000 entries.
`prometheus`	Scrape endpoint `0.0.0.0:8889`	Namespace `llm` → metric names become `llm_prompt_tokens_total`, `llm_completion_tokens_total`, `llm_total_tokens_total`. 5-minute metric expiration. Resource attributes converted to labels.
`debug`	Collector stdout	Verbosity: `basic`. Development aid only.

Extensions #

Extension	Endpoint	Use
`health_check`	`0.0.0.0:13133`	External liveness probe (not usable inside the distroless container itself).
`pprof`	`0.0.0.0:1777`	Go pprof profiling for collector performance analysis.
`zpages`	`0.0.0.0:55679`	In-process debug page showing pipeline stats, latency, errors.

Grafana Tempo 2.4.1 #

Distributed trace backend. Receives spans from the OTel Collector via OTLP gRPC on its internal port 4317. Exposes a Grafana datasource on HTTP 3200 for TraceQL queries. Configuration: infra/observability/tempo.yaml.

Metrics Generator #

Tempo’s built-in metrics_generator derives span metrics and service graphs from ingested traces, then remote-writes them to Prometheus:

metrics_generator:
  storage:
    remote_write:
      - url: http://amtp-prometheus:9090/api/v1/write
        send_exemplars: true
  processor:
    service_graphs:
      dimensions: [service.name, span.kind]
    span_metrics:
      dimensions:
        - service.name
        - span.name
        - span.kind
        - status.code
        - llm.model
        - llm.provider
        - gen_ai.system
      enable_target_info: true

infra/observability/tempo.yaml § metrics_generator

The span-metric dimensions include llm.model, llm.provider, and gen_ai.system so Grafana can break down P95 latency and throughput by model and provider without requiring a separate labeling step.

Storage & Retention #

Backend: local filesystem (/var/tempo/blocks) backed by Docker volume tempo_data.
Retention: block_retention: 72h — traces older than 72 hours are compacted and dropped.
WAL: /var/tempo/wal for durability across ingester restarts.
Bloom filter: false-positive rate 0.01; block encoding zstd.
Max traces per user: 100,000. Ingestion rate limit: 15 MB/s, burst 20 MB/s.

Prometheus 2.52 (`amtp-prometheus`) #

Metrics store. The service is named amtp-prometheus (not the default prometheus) to avoid collisions with other Prometheus instances on the amtp_net network. Host port ${OBS_PROMETHEUS_PORT:-9091} maps to container port 9090. Configuration: infra/observability/prometheus.yml.

Scrape Targets #

scrape_configs:
  # LLM token metrics emitted by agents via OTEL collector
  - job_name: otel-collector
    static_configs:
      - targets: ['otel-collector:8889']

  # Tempo span metrics (service graphs, span durations) via metrics_generator
  - job_name: tempo
    static_configs:
      - targets: ['tempo:3200']
    metrics_path: /metrics

infra/observability/prometheus.yml § scrape_configs

Runtime Flags #

--storage.tsdb.retention.time=15d — metrics retained for 15 days.
--web.enable-remote-write-receiver — accepts remote-write from Tempo’s metrics generator.
--web.enable-lifecycle — exposes /-/reload and /-/quit endpoints.
External label: cluster=zt-amtp added to all scraped metrics.

Grafana 10.4.2 #

Visualization layer. Depends on both Tempo and Prometheus being healthy before starting. Provisioned via Git-backed YAML files mounted from infra/observability/grafana/provisioning/.

Provisioned Datasources #

UID	Type	URL	Notes
`tempo`	Tempo	`http://tempo:3200`	TraceQL; linked to Prometheus for trace→metrics correlation.
`prometheus`	Prometheus	`http://amtp-prometheus:9090`	Default datasource. PromQL for LLM token panels and span metrics.

LLM Token Accounting Dashboard #

UID: llm-token-accounting. Auto-provisioned from infra/observability/grafana/provisioning/dashboards/llm-token-accounting.json. Refresh: 30 s. Default time range: last 3 hours.

Panel	Type	Query (simplified)
Total Tokens	Stat	`sum(llm_prompt_tokens_total) + sum(llm_completion_tokens_total)`
Prompt Tokens	Stat	`sum(llm_prompt_tokens_total)`
Completion Tokens	Stat	`sum(llm_completion_tokens_total)`
P95 LLM Latency	Stat (ms)	`histogram_quantile(0.95, sum by(le)(rate(traces_spanmetrics_duration_seconds_bucket{span_name="llm.completion"}[5m]))) * 1000`
LLM Calls / sec	Stat (req/s)	`sum(rate(traces_spanmetrics_calls_total{span_name="llm.completion"}[5m]))`
Token Rate (per minute)	Time series	Prompt + Completion + Total rates over 1 m window
LLM Trace Search	Traces (Tempo)	Native Tempo search filtered by `service_name` / `span_name` template variables

Generic OAuth / OIDC SSO #

Grafana supports any OAuth 2.0 / OIDC provider via the GF_AUTH_GENERIC_OAUTH_* environment variable conventions (mapped to the [auth.generic_oauth] ini section). SSO is disabled by default (GF_AUTH_GENERIC_OAUTH_ENABLED=false). Role attribute path defaults to: grant Admin if the user is in group grafana-admins, otherwise Viewer.

SMTP for email alerts is likewise disabled by default (GF_SMTP_ENABLED=false). All configurable values are documented in .env.example § Observability stack.

Alerting #

Alerting is provisioned from infra/observability/grafana/provisioning/alerting/alerting.yaml. The provisioning file defines contact points, notification policies, mute timings, and alert rules in a single YAML document.

No env-var interpolation in provisioning YAML. Grafana reads provisioning files as static YAML at startup — environment variables are not substituted. The CHANGEME_SLACK_WEBHOOK_URL and CHANGEME_ALERT_EMAIL_ADDRESSES placeholder values in alerting.yaml must be replaced with real values before the container starts, or the contact points must be configured via the Grafana UI (Alerting → Contact points) after first boot.

No hot-reload on YAML edits. Grafana file-based alert provisioning loads only at container startup. Editing alerting.yaml after the container is running has no effect until one of:

POST http://localhost:${OBS_GRAFANA_PORT:-3000}/api/admin/provisioning/alerting/reload (requires admin credentials in the Authorization header or a session cookie), or
docker compose -f docker-compose.observability.yml restart grafana

UI edits to provisioned alert rules are accepted but are labelled read-only and will be overwritten on the next container start. The YAML file is the authoritative source of truth.

Contact Points #

Name	Type	Channel
`slack-llm-alerts`	Slack	`#llm-alerts` via `CHANGEME_SLACK_WEBHOOK_URL`
`email-llm-alerts`	Email	`CHANGEME_ALERT_EMAIL_ADDRESSES` (single combined email)

Notification Policies #

Default receiver: slack-llm-alerts. Group by: alertname, service_name, cluster.
Group wait 30 s → interval 5 m → repeat 4 h.
Sub-route for severity=critical: additionally routes to email-llm-alerts with group wait 0 s, interval 1 m, repeat 1 h, continue: true so Slack still fires.

Mute Timings #

A maintenance-window mute timing suppresses all notifications on Saturday and Sunday 00:00–23:59 (server timezone).

Alert Rules #

Rule UID	Title	Severity	For	Summary
`otel-collector-down`	OTEL Collector Unreachable	critical	2 m	No scrape data from `otel-collector:8888` for >2 minutes. Traces may be lost.

Pipeline Verification (apps/otel-verify/) #

apps/otel-verify/verify_trace.py is a self-contained Python script that emits one structured LLM trace plus LLM token counters through the real OTel Collector to verify the full observability pipeline end-to-end. It is used by .github/workflows/observability-ci-cd.yml as the post-deploy acceptance gate.

What it emits #

Trace: a root span llm.completion with two child spans (llm.inference, llm.response.parse). Span attributes match the LLM instrumentation contract below (128 prompt / 64 completion / 192 total tokens; model gpt-4o; provider openai).
Metrics: counters prompt_tokens, completion_tokens, total_tokens with attributes llm.model, llm.provider, gen_ai.system.

Trace ID extraction #

verify_trace.py prints the phrase “Trace ID” on multiple lines — once with the 32-hex value and once in the user-hint line “search by Trace ID above”. A naïve awk '{print $NF}' produces a multi-line string that breaks the subsequent curl URL. The canonical extraction in CI uses a strict regex:

TRACE_ID=$(
  python apps/otel-verify/verify_trace.py \
    | grep -oE '[0-9a-f]{32}' \
    | head -n 1
)

An OTEL trace ID is exactly 32 lowercase hex characters by spec. grep -oE '[0-9a-f]{32}' extracts only that token; head -n 1 collapses duplicate matches.

Tempo polling #

After emitting the trace, CI polls GET /api/traces/{TRACE_ID} on Tempo in a retry loop (20 attempts × 3 s = 60 s budget):

for i in $(seq 1 20); do
  STATUS=$(curl -s -o /dev/null -w '%{http_code}' \
    "http://localhost:${OBS_TEMPO_HTTP_PORT}/api/traces/${TRACE_ID}" || true)
  if [ "$STATUS" = "200" ]; then
    echo "Trace ${TRACE_ID} confirmed in Tempo (attempt $i)"
    exit 0
  fi
  echo "Waiting for Tempo to ingest trace... ($i/20, status=$STATUS)"
  sleep 3
done
exit 1

The 60 s budget covers cold ingester starts where OTLP-emit to queryable latency can exceed 10 s.

Dependencies #

opentelemetry-api==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0

apps/otel-verify/requirements.txt — installed into an isolated .venv during each CI run.

LLM Instrumentation Contract #

All LLM agent spans must carry the following attributes to be correctly bucketed by the LLM Token Accounting dashboard and the Tempo span-metric dimensions.

Required span attributes #

Attribute	Type	Example
`llm.provider`	string	`openai`
`llm.model`	string	`gpt-4o`
`llm.request.type`	string	`chat`
`llm.tokens.prompt`	int	`128`
`llm.tokens.completion`	int	`64`
`llm.tokens.total`	int	`192`
`gen_ai.system`	string	`openai`
`gen_ai.request.model`	string	`gpt-4o`

Required metric counters #

Emitted as OTLP counter metrics (no llm_ prefix in the instrument name — the Collector’s Prometheus exporter adds the llm namespace and the _total suffix):

Instrument name	Prometheus name	Unit
`prompt_tokens`	`llm_prompt_tokens_total`	token
`completion_tokens`	`llm_completion_tokens_total`	token
`total_tokens`	`llm_total_tokens_total`	token

Required metric attributes: llm.model, llm.provider, gen_ai.system.

GitHub MCP instrumentation #

The amtp-github-mcp service emits its own OpenTelemetry spans and metrics via the same OTLP HTTP endpoint (http://otel-collector:4318). OTEL_SERVICE_NAME=amtp-github-mcp is set in the container environment so traces and metrics are attributed separately from the LLM agents. Key span names: repo.tree, repo.read_file, github.api, valkey.get, valkey.set. Key metric histograms: mcp.tool.duration, mcp.github_api.duration. These use MCP-specific attributes (mcp.tool, mcp.cache_hit, mcp.exclusion_reason) and are not subject to the LLM instrumentation contract above.

Host Ports & `OBS_*` Overrides #

Every observability host port is parameterized via an OBS_* environment variable so that port conflicts on a shared dev host can be resolved without code changes — set the relevant variable in GitHub Actions Settings → Environments → dev → Variables.

Variable	Default	Service	Purpose
`OBS_OTEL_GRPC_PORT`	`4317`	`otel-collector`	OTLP gRPC ingestion endpoint
`OBS_OTEL_HTTP_PORT`	`4318`	`otel-collector`	OTLP HTTP ingestion endpoint
`OBS_OTEL_METRICS_PORT`	`8888`	`otel-collector`	Collector self-metrics (Prometheus scrape)
`OBS_OTEL_HEALTH_PORT`	`13133`	`otel-collector`	health_check extension endpoint
`OBS_OTEL_ZPAGES_PORT`	`55679`	`otel-collector`	zpages debug UI
`OBS_TEMPO_HTTP_PORT`	`3200`	`tempo`	Tempo HTTP API (Grafana datasource + trace lookup)
`OBS_TEMPO_GRPC_PORT`	`9095`	`tempo`	Tempo gRPC API
`OBS_PROMETHEUS_PORT`	`9091`	`amtp-prometheus`	Prometheus HTTP API (maps to container :9090)
`OBS_GRAFANA_PORT`	`3000`	`grafana`	Grafana UI and API

Network Dependency (`amtp_net`) #

amtp_net must exist before the observability stack deploys. docker-compose.observability.yml declares amtp_net as external: true. Docker will refuse to start the stack with:

network amtp_net declared as external, but could not be found

if the network does not yet exist. The network is created by the first docker compose up of the main stack (docker-compose.yml) — so the main stack must be brought up at least once before the observability stack deploys.

Two remediation paths (apply one):

Workflow-level sequencing (recommended). Add needs: healthcheck to the observability job in .github/workflows/ci-cd.yml. This makes the base-stack bring-up a precondition, visible in the workflow graph.
Bootstrap-level creation. Add the following idempotent command to infra/bootstrap.sh so the network pre-exists any CI run:
```
docker network create amtp_net --label amtp.owner=bootstrap 2>/dev/null || true
```
Add after the Docker Engine installation block in infra/bootstrap.sh.

Until one path is applied, on a pristine host manually run either docker network create amtp_net or docker compose -f docker-compose.yml up -d --wait healthcheck before triggering the observability workflow.

Observability

Pipeline Overview #

OpenTelemetry Collector #

Receivers #

Processors #

Exporters #

Extensions #

Grafana Tempo 2.4.1 #

Metrics Generator #

Storage & Retention #

Prometheus 2.52 (amtp-prometheus) #

Scrape Targets #

Runtime Flags #

Grafana 10.4.2 #

Provisioned Datasources #

LLM Token Accounting Dashboard #

Generic OAuth / OIDC SSO #

Alerting #

Contact Points #

Notification Policies #

Mute Timings #

Alert Rules #

Pipeline Verification (apps/otel-verify/) #

What it emits #

Trace ID extraction #

Tempo polling #

Dependencies #

LLM Instrumentation Contract #

Required span attributes #

Required metric counters #

GitHub MCP instrumentation #

Host Ports & OBS_* Overrides #

Network Dependency (amtp_net) #

Prometheus 2.52 (`amtp-prometheus`) #

Host Ports & `OBS_*` Overrides #

Network Dependency (`amtp_net`) #