Deployment Runbook

This runbook covers local and self-hosted deployment of the AMTP infrastructure stack. All commands use Docker Compose. Secrets are managed via a .env file that must never be committed to the repository.

Dev-host Bootstrap Runbook #

A fresh RHEL / Rocky / Alma Linux dev host requires a one-time bootstrap before any CI/CD job can run. The infra/bootstrap.sh script handles Docker CE + Compose installation, firewall ports, and runner-user Docker group membership.

Step-by-step #

  1. Run bootstrap script.
    sudo ./infra/bootstrap.sh <runner_user>
    Installs Docker CE + Compose, opens firewalld ports 80 and 8083, adds <runner_user> to the docker group, and restarts the runner service.
  2. Register the self-hosted runner. In the GitHub repository, go to Settings → Actions → Runners → New self-hosted runner. Follow the on-screen instructions to download and configure the runner with label amtp-dev.
  3. Configure secrets. Add all secrets listed in CI/CD → Required GitHub Secrets & Vars to Settings → Environments → dev.
  4. Create .env. On the runner host, copy .env.example to .env and fill in production values. See Creating .env below.
  5. Provision GitHub App PEM secret. Base64-encode your GitHub App private key and add it as a GitHub Actions secret named GH_APP_PRIVATE_KEY_B64 in Settings → Environments → dev. Also add GH_APP_ID and GH_APP_INSTALLATION_ID. On the runner host, the deploy job decodes the key to secrets/github_app_key.pem automatically.
  6. Bring up the main stack first. Before the observability stack can deploy, amtp_net must exist. Run:
    docker compose up -d --wait postgres pgbouncer valkey healthcheck docs
    Creates amtp_net and brings all main-stack services to healthy. The observability CI/CD workflow can now attach to the network.
  7. Trigger CI. Push to dev or run Actions → CI/CD → Run workflow. All five child workflows will execute. On first run, the observability workflow will deploy Grafana, Tempo, Prometheus, and the OTel Collector, then run the post-deploy trace verification.

Prerequisites #

Creating .env #

cp .env.example .env
chmod 600 .env
# Edit .env and replace all placeholder values:
#   POSTGRES_PASSWORD=change-me  ->  a strong random password
#   VALKEY_PASSWORD=change-me    ->  a strong random password
Template: .env.example
POSTGRES_DB=amtp
POSTGRES_USER=amtp
POSTGRES_PASSWORD=change-me
COMPOSE_PROJECT_NAME=amtp
VALKEY_HOST=valkey
VALKEY_PORT=6379
VALKEY_PASSWORD=change-me
DOCS_HOST_PORT=80

# GitHub MCP server
GITHUB_APP_ID=
GITHUB_APP_INSTALLATION_ID=
# PEM key is mounted via Docker secret: ./secrets/github_app_key.pem
.env.example — all values must be changed before first deployment.

Bring-up Order #

Services must be started in the following order. PgBouncer and Valkey depend on Postgres and Valkey being healthy respectively; the depends_on: condition: service_healthy clauses in docker-compose.yml enforce this automatically.

Step 1 — Start Postgres #

docker compose up -d postgres

Wait for the healthcheck to pass (up to 100 seconds: 20 retries × 5 s interval):

docker compose ps postgres   # look for "(healthy)" in the STATUS column

Step 2 — Run Flyway migrations #

The flyway service uses the opt-in migrate profile and runs as a one-shot container. It will exit 0 on success.

docker compose --profile migrate run --rm flyway migrate
# Verify migration state:
docker compose --profile migrate run --rm flyway info

Expected output: all six migrations (V1–V6) listed with state Success. If any migration shows Failed, do not proceed. Review the Flyway logs and consult Data Model for the offending migration.

Step 3 — Start PgBouncer #

docker compose up -d pgbouncer

PgBouncer waits for Postgres to be healthy before starting (depends_on: condition: service_healthy).

Step 4 — Start Valkey #

docker compose up -d valkey

Wait for the Valkey healthcheck to pass:

docker compose ps valkey   # look for "(healthy)"

Step 5 — Start Healthcheck service #

docker compose up -d --build healthcheck

The healthcheck service depends on Valkey being healthy before starting.

One-shot full stack (after first migration) #

Once migrations have been applied at least once, the entire stack (excluding Flyway) can be brought up in one command:

docker compose up -d postgres pgbouncer valkey healthcheck

Verification #

Healthcheck probe #

curl -sf http://localhost:8083/health

Expected response (200 OK):

{ "status": "ok", "valkey": "up", "latency_ms": 0.42 }

If the response is 503, check Valkey logs: docker compose logs valkey. A response of "probe": "write" in the error body (after the three-step probe is implemented) indicates an OOM condition. A "valkey": "down" response from the current PING-only implementation indicates a connectivity failure.

Postgres connectivity #

# Via PgBouncer (applications use this path)
psql "postgresql://amtp:<password>@localhost:6432/amtp" -c "SELECT version();"

# Direct Postgres (override file required for port 5432)
psql "postgresql://amtp:<password>@localhost:5432/amtp" -c "\dt"

Valkey connectivity #

redis-cli -h localhost -p 6379 -a "<password>" PING
# Expected: PONG

redis-cli -h localhost -p 6379 -a "<password>" SET test_key "1" EX 5
redis-cli -h localhost -p 6379 -a "<password>" GET test_key
# Expected: 1

Migration status #

docker compose --profile migrate run --rm flyway info

All six migrations should show Success. No pending migrations should remain.

Running Migrations #

Flyway is the sole migration mechanism. Never apply schema changes manually via psql. All changes must go through versioned SQL files in migrations/sql/.

Apply new migrations #

docker compose --profile migrate run --rm flyway migrate

Check current state #

docker compose --profile migrate run --rm flyway info

Validate checksums #

docker compose --profile migrate run --rm flyway validate

Validation fails if any previously-applied migration file has been modified. Do not alter files in migrations/sql/ after they have been applied to any environment.

Rollback & Teardown #

Stop all services (preserve data) #

docker compose down

Stop and remove all data (full reset) #

docker compose down -v --remove-orphans

Database schema rollback #

Flyway Community Edition does not support undo migrations. The rollback procedure for a failed migration is:

  1. Stop all application services.
  2. Manually restore from the last known-good Postgres backup.
  3. Redeploy from the last known-good migration version.

Secret Rotation #

Rotating POSTGRES_PASSWORD #

  1. Connect to Postgres directly:
    ALTER USER amtp WITH PASSWORD '<new-password>';
  2. Update POSTGRES_PASSWORD in .env and in GitHub repository secrets.
  3. Restart PgBouncer to pick up the new password:
    docker compose restart pgbouncer

Rotating VALKEY_PASSWORD #

  1. Update VALKEY_PASSWORD in .env and in GitHub repository secrets.
  2. Restart Valkey and the healthcheck service:
    docker compose up -d --force-recreate valkey healthcheck
  3. Note: Valkey is ephemeral (no persistence). All cached data is lost on restart. MCP caches will be rebuilt on next use. Rate-limit state resets.

Pinned image tags #

All images in docker-compose.yml use pinned minor-version tags (postgres:15-alpine, edoburu/pgbouncer:1.22.1, flyway/flyway:10-alpine, valkey/valkey:8.0). Before updating a tag, review the upstream changelog for breaking configuration changes, then update docker-compose.yml and re-run the full bring-up and verification sequence in a non-production environment first.