Observability

AMTP ships a full observability stack deployed as a separate Docker Compose project (docker-compose.observability.yml). LLM agents emit structured OpenTelemetry traces and token-count metrics through an OTel Collector, which forwards traces to Grafana Tempo and metrics to Prometheus. Grafana aggregates both signals into a pre-built LLM Token Accounting dashboard and an alerting pipeline that fires to Slack and email.

Pipeline Overview #

The end-to-end signal flow from agent execution to Grafana visualization:

LLM Agents (OTLP HTTP :4318 / gRPC :4317)
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│  OpenTelemetry Collector  (otel/opentelemetry-collector-contrib:0.100.0)  │
│                                                         │
│  receivers:  otlp (grpc :4317, http :4318)              │
│  processors: memory_limiter → attributes/llm_meta       │
│              → batch                                     │
│  exporters:  otlp/tempo (traces) · prometheus (metrics) │
└───────────┬────────────────────────┬────────────────────┘
            │ traces (OTLP gRPC)     │ metrics (Prometheus scrape :8889)
            ▼                        ▼
   ┌─────────────────┐     ┌───────────────────────┐
   │  Grafana Tempo  │     │  Prometheus 2.52       │
   │  2.4.1          │     │  (amtp-prometheus)     │
   │  HTTP :3200     │     │  :9091 (host)          │
   │  gRPC :9095     │     │  15 d retention        │
   └────────┬────────┘     └──────────┬────────────┘
            │ TraceQL               │ PromQL
            └──────────┬────────────┘
                       ▼
              ┌─────────────────┐
              │  Grafana 10.4.2 │
              │  :3000 (host)   │
              │  Datasources:   │
              │    Tempo · Prom │
              │  Dashboard:     │
              │    LLM Token    │
              │    Accounting   │
              └─────────────────┘
AMTP observability signal flow — docker-compose.observability.yml
Service Image Role Host port(s)
otel-collector otel/opentelemetry-collector-contrib:0.100.0 Receive, process, and route all telemetry ${OBS_OTEL_GRPC_PORT:-4317}, ${OBS_OTEL_HTTP_PORT:-4318}, ${OBS_OTEL_METRICS_PORT:-8888}, ${OBS_OTEL_HEALTH_PORT:-13133}, ${OBS_OTEL_ZPAGES_PORT:-55679}
tempo grafana/tempo:2.4.1 Distributed trace storage & query ${OBS_TEMPO_HTTP_PORT:-3200}, ${OBS_TEMPO_GRPC_PORT:-9095}
amtp-prometheus prom/prometheus:v2.52.0 Metrics storage, scraping, & alerting ${OBS_PROMETHEUS_PORT:-9091}9090
grafana grafana/grafana:10.4.2 Visualization, dashboards, alerting UI ${OBS_GRAFANA_PORT:-3000}

OpenTelemetry Collector #

The Collector is the single ingestion point for all AMTP telemetry. Configuration: infra/observability/otel-collector-config.yaml.

Receivers #

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["http://*", "https://*"]
infra/observability/otel-collector-config.yaml § receivers

Processors #

Processor Purpose
memory_limiter Hard cap at 512 MiB; spike limit 128 MiB; checked every 1 s.
attributes/llm_meta Upserts telemetry.sdk.name=opentelemetry; inserts pipeline.source=zt-amtp on every span for downstream filtering.
batch Aggregates spans into batches (timeout 1 s, size 1024, max 2048) before forwarding to reduce network overhead.

Pipeline execution order: memory_limiter → attributes/llm_meta → batch.

Exporters #

Exporter Target Notes
otlp/tempo tempo:4317 (gRPC, insecure) Retry on failure: initial 1 s, max 30 s, elapsed 120 s. Queue: 4 consumers, 1000 entries.
prometheus Scrape endpoint 0.0.0.0:8889 Namespace llm → metric names become llm_prompt_tokens_total, llm_completion_tokens_total, llm_total_tokens_total. 5-minute metric expiration. Resource attributes converted to labels.
debug Collector stdout Verbosity: basic. Development aid only.

Extensions #

Extension Endpoint Use
health_check 0.0.0.0:13133 External liveness probe (not usable inside the distroless container itself).
pprof 0.0.0.0:1777 Go pprof profiling for collector performance analysis.
zpages 0.0.0.0:55679 In-process debug page showing pipeline stats, latency, errors.

Grafana Tempo 2.4.1 #

Distributed trace backend. Receives spans from the OTel Collector via OTLP gRPC on its internal port 4317. Exposes a Grafana datasource on HTTP 3200 for TraceQL queries. Configuration: infra/observability/tempo.yaml.

Metrics Generator #

Tempo’s built-in metrics_generator derives span metrics and service graphs from ingested traces, then remote-writes them to Prometheus:

metrics_generator:
  storage:
    remote_write:
      - url: http://amtp-prometheus:9090/api/v1/write
        send_exemplars: true
  processor:
    service_graphs:
      dimensions: [service.name, span.kind]
    span_metrics:
      dimensions:
        - service.name
        - span.name
        - span.kind
        - status.code
        - llm.model
        - llm.provider
        - gen_ai.system
      enable_target_info: true
infra/observability/tempo.yaml § metrics_generator

The span-metric dimensions include llm.model, llm.provider, and gen_ai.system so Grafana can break down P95 latency and throughput by model and provider without requiring a separate labeling step.

Storage & Retention #

Prometheus 2.52 (amtp-prometheus) #

Metrics store. The service is named amtp-prometheus (not the default prometheus) to avoid collisions with other Prometheus instances on the amtp_net network. Host port ${OBS_PROMETHEUS_PORT:-9091} maps to container port 9090. Configuration: infra/observability/prometheus.yml.

Scrape Targets #

scrape_configs:
  # LLM token metrics emitted by agents via OTEL collector
  - job_name: otel-collector
    static_configs:
      - targets: ['otel-collector:8889']

  # Tempo span metrics (service graphs, span durations) via metrics_generator
  - job_name: tempo
    static_configs:
      - targets: ['tempo:3200']
    metrics_path: /metrics
infra/observability/prometheus.yml § scrape_configs

Runtime Flags #

Grafana 10.4.2 #

Visualization layer. Depends on both Tempo and Prometheus being healthy before starting. Provisioned via Git-backed YAML files mounted from infra/observability/grafana/provisioning/.

Provisioned Datasources #

UID Type URL Notes
tempo Tempo http://tempo:3200 TraceQL; linked to Prometheus for trace→metrics correlation.
prometheus Prometheus http://amtp-prometheus:9090 Default datasource. PromQL for LLM token panels and span metrics.

LLM Token Accounting Dashboard #

UID: llm-token-accounting. Auto-provisioned from infra/observability/grafana/provisioning/dashboards/llm-token-accounting.json. Refresh: 30 s. Default time range: last 3 hours.

Panel Type Query (simplified)
Total Tokens Stat sum(llm_prompt_tokens_total) + sum(llm_completion_tokens_total)
Prompt Tokens Stat sum(llm_prompt_tokens_total)
Completion Tokens Stat sum(llm_completion_tokens_total)
P95 LLM Latency Stat (ms) histogram_quantile(0.95, sum by(le)(rate(traces_spanmetrics_duration_seconds_bucket{span_name="llm.completion"}[5m]))) * 1000
LLM Calls / sec Stat (req/s) sum(rate(traces_spanmetrics_calls_total{span_name="llm.completion"}[5m]))
Token Rate (per minute) Time series Prompt + Completion + Total rates over 1 m window
LLM Trace Search Traces (Tempo) Native Tempo search filtered by service_name / span_name template variables

Generic OAuth / OIDC SSO #

Grafana supports any OAuth 2.0 / OIDC provider via the GF_AUTH_GENERIC_OAUTH_* environment variable conventions (mapped to the [auth.generic_oauth] ini section). SSO is disabled by default (GF_AUTH_GENERIC_OAUTH_ENABLED=false). Role attribute path defaults to: grant Admin if the user is in group grafana-admins, otherwise Viewer.

SMTP for email alerts is likewise disabled by default (GF_SMTP_ENABLED=false). All configurable values are documented in .env.example § Observability stack.

Alerting #

Alerting is provisioned from infra/observability/grafana/provisioning/alerting/alerting.yaml. The provisioning file defines contact points, notification policies, mute timings, and alert rules in a single YAML document.

Contact Points #

Name Type Channel
slack-llm-alerts Slack #llm-alerts via CHANGEME_SLACK_WEBHOOK_URL
email-llm-alerts Email CHANGEME_ALERT_EMAIL_ADDRESSES (single combined email)

Notification Policies #

Mute Timings #

A maintenance-window mute timing suppresses all notifications on Saturday and Sunday 00:00–23:59 (server timezone).

Alert Rules #

Rule UID Title Severity For Summary
otel-collector-down OTEL Collector Unreachable critical 2 m No scrape data from otel-collector:8888 for >2 minutes. Traces may be lost.

Pipeline Verification (apps/otel-verify/) #

apps/otel-verify/verify_trace.py is a self-contained Python script that emits one structured LLM trace plus LLM token counters through the real OTel Collector to verify the full observability pipeline end-to-end. It is used by .github/workflows/observability-ci-cd.yml as the post-deploy acceptance gate.

What it emits #

Trace ID extraction #

verify_trace.py prints the phrase “Trace ID” on multiple lines — once with the 32-hex value and once in the user-hint line “search by Trace ID above”. A naïve awk '{print $NF}' produces a multi-line string that breaks the subsequent curl URL. The canonical extraction in CI uses a strict regex:

TRACE_ID=$(
  python apps/otel-verify/verify_trace.py \
    | grep -oE '[0-9a-f]{32}' \
    | head -n 1
)
An OTEL trace ID is exactly 32 lowercase hex characters by spec. grep -oE '[0-9a-f]{32}' extracts only that token; head -n 1 collapses duplicate matches.

Tempo polling #

After emitting the trace, CI polls GET /api/traces/{TRACE_ID} on Tempo in a retry loop (20 attempts × 3 s = 60 s budget):

for i in $(seq 1 20); do
  STATUS=$(curl -s -o /dev/null -w '%{http_code}' \
    "http://localhost:${OBS_TEMPO_HTTP_PORT}/api/traces/${TRACE_ID}" || true)
  if [ "$STATUS" = "200" ]; then
    echo "Trace ${TRACE_ID} confirmed in Tempo (attempt $i)"
    exit 0
  fi
  echo "Waiting for Tempo to ingest trace... ($i/20, status=$STATUS)"
  sleep 3
done
exit 1
The 60 s budget covers cold ingester starts where OTLP-emit to queryable latency can exceed 10 s.

Dependencies #

opentelemetry-api==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0
apps/otel-verify/requirements.txt — installed into an isolated .venv during each CI run.

LLM Instrumentation Contract #

All LLM agent spans must carry the following attributes to be correctly bucketed by the LLM Token Accounting dashboard and the Tempo span-metric dimensions.

Required span attributes #

Attribute Type Example
llm.provider string openai
llm.model string gpt-4o
llm.request.type string chat
llm.tokens.prompt int 128
llm.tokens.completion int 64
llm.tokens.total int 192
gen_ai.system string openai
gen_ai.request.model string gpt-4o

Required metric counters #

Emitted as OTLP counter metrics (no llm_ prefix in the instrument name — the Collector’s Prometheus exporter adds the llm namespace and the _total suffix):

Instrument name Prometheus name Unit
prompt_tokens llm_prompt_tokens_total token
completion_tokens llm_completion_tokens_total token
total_tokens llm_total_tokens_total token

Required metric attributes: llm.model, llm.provider, gen_ai.system.

GitHub MCP instrumentation #

The amtp-github-mcp service emits its own OpenTelemetry spans and metrics via the same OTLP HTTP endpoint (http://otel-collector:4318). OTEL_SERVICE_NAME=amtp-github-mcp is set in the container environment so traces and metrics are attributed separately from the LLM agents. Key span names: repo.tree, repo.read_file, github.api, valkey.get, valkey.set. Key metric histograms: mcp.tool.duration, mcp.github_api.duration. These use MCP-specific attributes (mcp.tool, mcp.cache_hit, mcp.exclusion_reason) and are not subject to the LLM instrumentation contract above.

Host Ports & OBS_* Overrides #

Every observability host port is parameterized via an OBS_* environment variable so that port conflicts on a shared dev host can be resolved without code changes — set the relevant variable in GitHub Actions Settings → Environments → dev → Variables.

Variable Default Service Purpose
OBS_OTEL_GRPC_PORT 4317 otel-collector OTLP gRPC ingestion endpoint
OBS_OTEL_HTTP_PORT 4318 otel-collector OTLP HTTP ingestion endpoint
OBS_OTEL_METRICS_PORT 8888 otel-collector Collector self-metrics (Prometheus scrape)
OBS_OTEL_HEALTH_PORT 13133 otel-collector health_check extension endpoint
OBS_OTEL_ZPAGES_PORT 55679 otel-collector zpages debug UI
OBS_TEMPO_HTTP_PORT 3200 tempo Tempo HTTP API (Grafana datasource + trace lookup)
OBS_TEMPO_GRPC_PORT 9095 tempo Tempo gRPC API
OBS_PROMETHEUS_PORT 9091 amtp-prometheus Prometheus HTTP API (maps to container :9090)
OBS_GRAFANA_PORT 3000 grafana Grafana UI and API

Network Dependency (amtp_net) #

Two remediation paths (apply one):

  1. Workflow-level sequencing (recommended). Add needs: healthcheck to the observability job in .github/workflows/ci-cd.yml. This makes the base-stack bring-up a precondition, visible in the workflow graph.
  2. Bootstrap-level creation. Add the following idempotent command to infra/bootstrap.sh so the network pre-exists any CI run:
    docker network create amtp_net --label amtp.owner=bootstrap 2>/dev/null || true
    Add after the Docker Engine installation block in infra/bootstrap.sh.

Until one path is applied, on a pristine host manually run either docker network create amtp_net or docker compose -f docker-compose.yml up -d --wait healthcheck before triggering the observability workflow.

See also: Observability CI/CD workflow and Dev-host bootstrap runbook.