CI/CD & Self-hosted Runner

AMTP uses a dedicated RHEL self-hosted GitHub Actions runner (amtp-dev-runner, label amtp-dev) to execute all CI/CD jobs. A root orchestrator (.github/workflows/ci-cd.yml) fans out to five reusable child workflows via uses: / workflow_call.

Self-hosted Runner #

Attribute Value
Operating system RHEL / Rocky / Alma Linux (dnf-based)
Runner name amtp-dev-runner
Runner label amtp-dev (custom); self-hosted, Linux, X64 (auto-added)
runs-on value [self-hosted, Linux, X64, amtp-dev]
Runner type Persistent, non-ephemeral
Docker availability Required — installed by infra/bootstrap.sh. All CI jobs use Docker Compose.
Provisioning One-time: sudo ./infra/bootstrap.sh <runner_user>. See Dev-host bootstrap runbook.

Because the runner is persistent, any files written to the checkout directory persist across job runs unless explicitly deleted. The .env cleanup step documented in Secret-File Hygiene addresses this.

Workflow Topology #

The root .github/workflows/ci-cd.yml fans out to four reusable child workflows in parallel.

ci-cd.yml

workflow_call dispatch (parallel)

db-ci-cd.yml database validation & migration
validate migrate
healthcheck-ci-cd.yml image build & live probe
validate deploy
docs-ci-cd.yml build, index, deploy docs
validate deploy
github-mcp-ci-cd.yml typecheck, test, build & deploy
validate deploy
observability-ci-cd.yml OTel stack + trace verify
validate deploy
.github/workflows/ci-cd.yml § jobs — all five children use secrets: inherit.

Trigger paths #

on:
  push:
    branches: [main, dev]
    paths:
      - "migrations/**"
      - "apps/healthcheck/**"
      - "apps/github-mcp/**"
      - "apps/otel-verify/**"
      - "docs/**"
      - "infra/nginx/**"
      - "infra/observability/**"
      - "docker-compose.yml"
      - "docker-compose.observability.yml"
      - "docker-compose.obs-ci.yml"
      - ".github/workflows/ci-cd.yml"
      - ".github/workflows/db-ci-cd.yml"
      - ".github/workflows/healthcheck-ci-cd.yml"
      - ".github/workflows/github-mcp-ci-cd.yml"
      - ".github/workflows/docs-ci-cd.yml"
      - ".github/workflows/observability-ci-cd.yml"
  pull_request:
    branches: [main, dev]
    paths:
      - "migrations/**"
      - "apps/healthcheck/**"
      - "apps/github-mcp/**"
      - "apps/otel-verify/**"
      - "docs/**"
      - "infra/nginx/**"
      - "infra/observability/**"
  workflow_dispatch:
.github/workflows/ci-cd.yml § on

Root workflow jobs #

jobs:
  database:
    name: Database CI/CD
    uses: ./.github/workflows/db-ci-cd.yml
    secrets: inherit

  healthcheck:
    name: Healthcheck CI/CD
    uses: ./.github/workflows/healthcheck-ci-cd.yml
    secrets: inherit

  github-mcp:
    name: GitHub MCP CI/CD
    uses: ./.github/workflows/github-mcp-ci-cd.yml
    secrets: inherit
    needs: healthcheck

  docs:
    name: Docs CI/CD
    uses: ./.github/workflows/docs-ci-cd.yml
    secrets: inherit

  observability:
    name: Observability CI/CD
    uses: ./.github/workflows/observability-ci-cd.yml
    secrets: inherit
.github/workflows/ci-cd.yml § jobssecrets: inherit is mandatory on every child-workflow call (see Required GitHub Secrets & Vars).

Database Workflow (db-ci-cd.yml) #

Two jobs: validate (always runs) and migrate (runs only on dev via workflow_call or workflow_dispatch).

Job: validate #

Spins up an ephemeral Postgres container namespaced with the run ID (COMPOSE_PROJECT_NAME=amtp-ci-${{ github.run_id }}), runs flyway migrate against it, prints migration status, then tears down the stack including all volumes.

- name: Start ephemeral Postgres
  run: docker compose -f docker-compose.yml up -d postgres

- name: Run Flyway migrate (ephemeral)
  run: docker compose -f docker-compose.yml --profile migrate run --rm flyway migrate

- name: Show migration status
  run: docker compose -f docker-compose.yml --profile migrate run --rm flyway info

- name: Tear down ephemeral stack
  if: always()
  run: docker compose -f docker-compose.yml down -v --remove-orphans
.github/workflows/db-ci-cd.yml § validate

Job: migrate #

Applies validated migrations to the persistent dev environment. Gated to the dev branch. Uses concurrency group amtp-db-dev with cancel-in-progress: false to ensure migrations are never cancelled mid-flight.

Healthcheck Workflow (healthcheck-ci-cd.yml) #

Two jobs: validate (builds the Docker image) and deploy (brings up Valkey + healthcheck on dev and probes /health).

Job: validate #

- name: Build healthcheck image
  run: docker compose -f docker-compose.yml build healthcheck

- name: Tear down
  if: always()
  run: docker compose -f docker-compose.yml down --volumes --remove-orphans
.github/workflows/healthcheck-ci-cd.yml § validate

Job: deploy #

Builds and starts Valkey and the healthcheck service, blocks until both services report healthy, then smoke-tests http://localhost:8083/health.

- name: Build and start Valkey + Healthcheck (block until healthy)
  run: docker compose -f docker-compose.yml up -d --build --wait valkey healthcheck

- name: Smoke-test health endpoint from host
  run: curl -sf http://localhost:8083/health
.github/workflows/healthcheck-ci-cd.yml § deploy

Concurrency group: amtp-healthcheck-dev, cancel-in-progress: false.

GitHub MCP Workflow (github-mcp-ci-cd.yml) #

Two jobs: validate (TypeScript typecheck + unit tests inside a Docker build, then production image build) and deploy (writes GitHub App credentials, brings up the service, and smoke-tests /healthz). The deploy job runs only on the dev branch and requires needs: healthcheck in the root orchestrator so Valkey is healthy before the MCP container starts.

Job: validate #

- name: Create ephemeral .env and dummy key
  run: |
    cat > .env << 'EOF'
    GITHUB_APP_ID=0
    GITHUB_APP_INSTALLATION_ID=0
    EOF
    mkdir -p secrets
    printf '-----BEGIN RSA PRIVATE KEY-----\nMIIE...\n-----END RSA PRIVATE KEY-----\n' \
      > secrets/github_app_key.pem

- name: Typecheck + test (inside Docker)
  run: docker build --target test -t amtp-github-mcp-test:${{ github.run_id }} apps/github-mcp

- name: Build production Docker image
  run: docker compose -f docker-compose.yml build github-mcp

- name: Tear down
  if: always()
  run: |
    docker compose -f docker-compose.yml down --volumes --remove-orphans
    docker rmi amtp-github-mcp-test:${{ github.run_id }} || true
.github/workflows/github-mcp-ci-cd.yml § validate — the --target test stage runs npm run typecheck then npm test inside the Docker build layer, so no Node.js is required on the runner.

Job: deploy #

- name: Write .env from secrets
  run: |
    printf '...GITHUB_APP_ID=%s\nGITHUB_APP_INSTALLATION_ID=%s\n' \
      "${{ secrets.GH_APP_ID }}" \
      "${{ secrets.GH_APP_INSTALLATION_ID }}" >> .env

- name: Write GitHub App private key
  run: |
    mkdir -p secrets
    printf '%s' "${{ secrets.GH_APP_PRIVATE_KEY_B64 }}" | base64 -d \
      > secrets/github_app_key.pem
    chmod 644 secrets/github_app_key.pem

- name: Deploy GitHub MCP (wait for healthy)
  run: docker compose -f docker-compose.yml up -d --build --wait github-mcp

- name: Smoke-test health endpoint
  run: curl -sf http://localhost:8090/healthz
.github/workflows/github-mcp-ci-cd.yml § deploy — the PEM file is decoded from a base64 GitHub secret to avoid newline-stripping issues. Concurrency group: amtp-github-mcp-dev, cancel-in-progress: false.

Docs Workflow (docs-ci-cd.yml) #

Builds the static docs site (node build.js + Pagefind index), validates it by serving it from an ephemeral nginx container, and deploys it to the persistent docs nginx service.

Job: validate #

Runs in an isolated compose project (COMPOSE_PROJECT_NAME=amtp-docs-validate-${{ github.run_id }}) with DOCS_HOST_PORT=8088 so the ephemeral validation container cannot collide with the persistent docs container that owns port 80.

- name: Guard against non-conf files in infra/nginx/
  run: test -z "$(find infra/nginx -maxdepth 1 -type f ! -name '*.conf')"

- name: Install docs build dependencies
  run: npm install
  working-directory: docs/AMTP_Docs_Website

- name: Hydrate diagrams and inject build timestamp
  run: node build.js
  working-directory: docs/AMTP_Docs_Website

- name: Build Pagefind search index
  run: npx pagefind --site . --output-path pagefind
  working-directory: docs/AMTP_Docs_Website

- name: Remove node_modules before serving
  run: rm -rf docs/AMTP_Docs_Website/node_modules

- name: Normalize file permissions for nginx DAC
  run: chmod -R a+rX docs infra/nginx

- name: Start docs (block until healthy)
  run: docker compose -f docker-compose.yml up -d --wait docs

- name: Smoke-test healthz
  run: curl -sf "http://127.0.0.1:${DOCS_HOST_PORT}/healthz"

- name: Smoke-test index page
  run: curl -sf "http://127.0.0.1:${DOCS_HOST_PORT}/" -o /dev/null

- name: Tear down ephemeral stack
  if: always()
  run: docker compose -f docker-compose.yml down --volumes --remove-orphans
.github/workflows/docs-ci-cd.yml § validate

The nginx conf guard (find infra/nginx -maxdepth 1 -type f ! -name '*.conf') catches any non-.conf file accidentally added to infra/nginx/. That directory is bind-mounted over /etc/nginx/conf.d/; nginx will crash on any file that isn’t valid conf syntax.

The DAC normalization (chmod -R a+rX docs infra/nginx) ensures the unprivileged nginx worker can read files that a restrictive runner umask may have left at mode 600 or 700, which would otherwise produce 403s.

Job: deploy #

Same build steps as validate, then deploys to the persistent docs service on port 80. Concurrency group: amtp-docs-dev, cancel-in-progress: false.

- name: Deploy docs (block until healthy)
  run: docker compose -f docker-compose.yml up -d --wait docs

- name: Smoke-test healthz from host
  run: curl -sf http://localhost/healthz

- name: Smoke-test index page from host
  run: curl -sf http://localhost/ -o /dev/null
.github/workflows/docs-ci-cd.yml § deploy

Observability Workflow (observability-ci-cd.yml) #

Deploys the four-service observability stack (OTel Collector, Tempo, Prometheus, Grafana) and verifies the end-to-end trace pipeline with a synthetic LLM span. See Observability for the full stack reference.

Job: validate #

Uses docker-compose.obs-ci.yml which defines amtp_net as a local network (not external), so each validate run creates its own isolated network and never collides with concurrent runs or the production stack. No host port bindings — all inter-service traffic uses container names.

Job: deploy #

Compose project isolation

Legacy-orphan self-healing

Before deploying, the workflow removes any observability containers left over from a previous run that used the incorrect project name (amtp instead of amtp-obs). Without this step, those containers would hold the host ports the new deploy needs, causing Bind for 0.0.0.0:<port> failed: port is already allocated.

Rolling update

- name: Rolling update — observability stack (block until all healthy)
  run: >
    docker compose -f docker-compose.observability.yml
    up -d --wait --wait-timeout 180 --remove-orphans
--wait blocks on each service’s compose-defined healthcheck and returns on first-healthy. 180 s timeout covers cold Grafana + Tempo startup on a loaded runner.

Post-deploy trace verification

After the stack is healthy, CI emits a synthetic LLM trace via apps/otel-verify/verify_trace.py and confirms it appeared in Tempo.

TRACE_ID=$(
  .venv/bin/python verify_trace.py \
    --endpoint "http://localhost:${OBS_OTEL_HTTP_PORT}" \
    2>/dev/null \
  | grep -oE '[0-9a-f]{32}' \
  | head -n 1
)
Post-deploy trace-ID extraction. head -n 1 collapses the duplicate match on the hint line.
for i in $(seq 1 20); do
  STATUS=$(curl -s -o /dev/null -w '%{http_code}' \
    "http://localhost:${OBS_TEMPO_HTTP_PORT}/api/traces/${TRACE_ID}" || true)
  if [ "$STATUS" = "200" ]; then
    echo "Trace ${TRACE_ID} confirmed in Tempo (attempt $i)"
    exit 0
  fi
  echo "Waiting for Tempo to ingest trace... ($i/20, status=$STATUS)"
  sleep 3
done
exit 1
Post-deploy Tempo polling. .github/workflows/observability-ci-cd.yml § post-deploy trace verification

Secret-File Hygiene on the Persistent Runner #

Every job that writes a .env file must include the following step as its final step with if: always():

- name: Purge ephemeral secret files
  if: always()
  shell: bash
  run: |
    rm -f .env
    find . -maxdepth 2 -type f -name '.env.*' -delete || true
    test ! -e .env
Required cleanup step for all jobs that write .env.

Additional hardening recommendations #

Required GitHub Secrets & Vars #

Core secrets (all workflows) #

Secret name Used in Description
POSTGRES_PASSWORD db-ci-cd.yml, healthcheck-ci-cd.yml, observability-ci-cd.yml PostgreSQL superuser password.
VALKEY_PASSWORD healthcheck-ci-cd.yml (deploy), observability-ci-cd.yml Valkey requirepass value.
GH_APP_ID github-mcp-ci-cd.yml (deploy) GitHub App numeric ID. The GITHUB_ prefix is reserved by Actions; secret must be named GH_APP_ID.
GH_APP_INSTALLATION_ID github-mcp-ci-cd.yml (deploy) GitHub App installation ID for the target org/user.
GH_APP_PRIVATE_KEY_B64 github-mcp-ci-cd.yml (deploy) Base64-encoded PEM private key for the GitHub App. Decoded on-runner to secrets/github_app_key.pem and mounted as a Docker secret.

Grafana secrets #

Secret name Description
GF_SECURITY_ADMIN_PASSWORD Grafana admin password. Must change from default changeme.
GF_SECURITY_SECRET_KEY 32-char secret key for Grafana session signing.
GF_AUTH_GENERIC_OAUTH_CLIENT_ID OAuth 2.0 / OIDC client ID (required if SSO enabled).
GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET OAuth 2.0 / OIDC client secret.
GF_SMTP_USER SMTP authentication username (required if SMTP enabled).
GF_SMTP_PASSWORD SMTP authentication password.
SLACK_WEBHOOK_URL Incoming Webhook URL for the #llm-alerts Slack channel. Also set as the literal value in CHANGEME_SLACK_WEBHOOK_URL inside alerting.yaml before container start.

GitHub Actions vars.* (non-secret) #

Variable Default Description
OBS_OTEL_GRPC_PORT 4317 OTel Collector OTLP gRPC host port
OBS_OTEL_HTTP_PORT 4318 OTel Collector OTLP HTTP host port
OBS_OTEL_METRICS_PORT 8888 OTel Collector self-metrics host port
OBS_OTEL_HEALTH_PORT 13133 OTel health_check extension host port
OBS_OTEL_ZPAGES_PORT 55679 OTel zpages debug UI host port
OBS_TEMPO_HTTP_PORT 3200 Tempo HTTP API host port
OBS_TEMPO_GRPC_PORT 9095 Tempo gRPC API host port
OBS_PROMETHEUS_PORT 9091 Prometheus HTTP API host port (maps to container :9090)
OBS_GRAFANA_PORT 3000 Grafana UI and API host port
GF_AUTH_GENERIC_OAUTH_ENABLED false Enable generic OAuth SSO
GF_SMTP_ENABLED false Enable SMTP for email alerts
ALERT_EMAIL_ADDRESSES Comma-separated list of alert recipients (also set in alerting.yaml)

All secrets are scoped to the dev GitHub environment (Settings → Environments → dev). Non-secret variables that may conflict with local port allocations are set under Settings → Environments → dev → Variables.