Deployment Runbook
This runbook covers local and self-hosted deployment of the AMTP
infrastructure stack. All commands use
Docker Compose. Secrets are managed via a .env
file that must never be
committed to the repository.
Dev-host Bootstrap Runbook #
A fresh RHEL / Rocky / Alma Linux dev host requires a one-time bootstrap before any CI/CD job can run. The infra/bootstrap.sh script handles Docker CE + Compose installation, firewall ports, and runner-user Docker group membership.
Step-by-step #
-
Run bootstrap script.
sudo ./infra/bootstrap.sh <runner_user>Installs Docker CE + Compose, opens firewalld ports 80 and 8083, adds <runner_user>to thedockergroup, and restarts the runner service. -
Register the self-hosted runner. In the GitHub
repository, go to
Settings → Actions → Runners → New self-hosted
runner. Follow the on-screen instructions to download and configure the
runner with label
amtp-dev. - Configure secrets. Add all secrets listed in CI/CD → Required GitHub Secrets & Vars to Settings → Environments → dev.
-
Create
.env. On the runner host, copy .env.example to.envand fill in production values. See Creating.envbelow. -
Provision GitHub App PEM secret. Base64-encode your
GitHub App private key and add it as a GitHub Actions secret named
GH_APP_PRIVATE_KEY_B64in Settings → Environments → dev. Also addGH_APP_IDandGH_APP_INSTALLATION_ID. On the runner host, the deploy job decodes the key tosecrets/github_app_key.pemautomatically. -
Bring up the main stack first. Before the
observability stack can deploy,
amtp_netmust exist. Run:docker compose up -d --wait postgres pgbouncer valkey healthcheck docsCreates amtp_netand brings all main-stack services to healthy. The observability CI/CD workflow can now attach to the network. -
Trigger CI. Push to
devor run Actions → CI/CD → Run workflow. All five child workflows will execute. On first run, the observability workflow will deploy Grafana, Tempo, Prometheus, and the OTel Collector, then run the post-deploy trace verification.
Prerequisites #
- Docker Engine ≥ 24 and Docker Compose v2 installed.
- Repository checked out at the target commit.
-
.envfile created from .env.example with production values (see below). - Outbound network access to Docker Hub for image pulls (on first run).
Creating .env
#
cp .env.example .env
chmod 600 .env
# Edit .env and replace all placeholder values:
# POSTGRES_PASSWORD=change-me -> a strong random password
# VALKEY_PASSWORD=change-me -> a strong random password
POSTGRES_DB=amtp
POSTGRES_USER=amtp
POSTGRES_PASSWORD=change-me
COMPOSE_PROJECT_NAME=amtp
VALKEY_HOST=valkey
VALKEY_PORT=6379
VALKEY_PASSWORD=change-me
DOCS_HOST_PORT=80
# GitHub MCP server
GITHUB_APP_ID=
GITHUB_APP_INSTALLATION_ID=
# PEM key is mounted via Docker secret: ./secrets/github_app_key.pem
Bring-up Order #
Services must be started in the following order. PgBouncer and Valkey
depend on Postgres and Valkey being healthy respectively; the
depends_on: condition: service_healthy clauses in
docker-compose.yml enforce this automatically.
Step 1 — Start Postgres #
docker compose up -d postgres
Wait for the healthcheck to pass (up to 100 seconds: 20 retries × 5 s interval):
docker compose ps postgres # look for "(healthy)" in the STATUS column
Step 2 — Run Flyway migrations #
The flyway service uses the opt-in
migrate profile and runs as a one-shot container. It will
exit 0 on success.
docker compose --profile migrate run --rm flyway migrate
# Verify migration state:
docker compose --profile migrate run --rm flyway info
Expected output: all six migrations (V1–V6) listed with state
Success. If any migration shows Failed, do
not proceed. Review the Flyway logs and consult
Data Model for the offending migration.
Step 3 — Start PgBouncer #
docker compose up -d pgbouncer
PgBouncer waits for Postgres to be healthy before starting (depends_on: condition:
service_healthy).
Step 4 — Start Valkey #
docker compose up -d valkey
Wait for the Valkey healthcheck to pass:
docker compose ps valkey # look for "(healthy)"
Step 5 — Start Healthcheck service #
docker compose up -d --build healthcheck
The healthcheck service depends on Valkey being healthy before starting.
One-shot full stack (after first migration) #
Once migrations have been applied at least once, the entire stack (excluding Flyway) can be brought up in one command:
docker compose up -d postgres pgbouncer valkey healthcheck
Verification #
Healthcheck probe #
curl -sf http://localhost:8083/health
Expected response (200 OK):
{ "status": "ok", "valkey": "up", "latency_ms": 0.42 }
If the response is 503, check Valkey logs:
docker compose logs valkey. A response of
"probe": "write" in the error body (after the three-step
probe is implemented) indicates an OOM condition. A
"valkey": "down" response from the current
PING-only implementation indicates a connectivity
failure.
Postgres connectivity #
# Via PgBouncer (applications use this path)
psql "postgresql://amtp:<password>@localhost:6432/amtp" -c "SELECT version();"
# Direct Postgres (override file required for port 5432)
psql "postgresql://amtp:<password>@localhost:5432/amtp" -c "\dt"
Valkey connectivity #
redis-cli -h localhost -p 6379 -a "<password>" PING
# Expected: PONG
redis-cli -h localhost -p 6379 -a "<password>" SET test_key "1" EX 5
redis-cli -h localhost -p 6379 -a "<password>" GET test_key
# Expected: 1
Migration status #
docker compose --profile migrate run --rm flyway info
All six migrations should show Success. No pending
migrations should remain.
Running Migrations #
Flyway is the sole migration mechanism. Never apply
schema changes manually via psql. All changes must go
through versioned SQL files in migrations/sql/.
Apply new migrations #
docker compose --profile migrate run --rm flyway migrate
Check current state #
docker compose --profile migrate run --rm flyway info
Validate checksums #
docker compose --profile migrate run --rm flyway validate
Validation fails if any previously-applied migration file has been modified. Do not alter files in migrations/sql/ after they have been applied to any environment.
Rollback & Teardown #
Stop all services (preserve data) #
docker compose down
Stop and remove all data (full reset) #
docker compose down -v --remove-orphans
Database schema rollback #
Flyway Community Edition does not support
undo migrations. The rollback procedure for a failed
migration is:
- Stop all application services.
- Manually restore from the last known-good Postgres backup.
- Redeploy from the last known-good migration version.
Secret Rotation #
Rotating POSTGRES_PASSWORD
#
-
Connect to Postgres directly:
ALTER USER amtp WITH PASSWORD '<new-password>'; -
Update
POSTGRES_PASSWORDin.envand in GitHub repository secrets. -
Restart PgBouncer to pick up the new password:
docker compose restart pgbouncer
Rotating VALKEY_PASSWORD
#
-
Update
VALKEY_PASSWORDin.envand in GitHub repository secrets. -
Restart Valkey and the healthcheck service:
docker compose up -d --force-recreate valkey healthcheck - Note: Valkey is ephemeral (no persistence). All cached data is lost on restart. MCP caches will be rebuilt on next use. Rate-limit state resets.
Pinned image tags #
All images in docker-compose.yml use pinned minor-version
tags (postgres:15-alpine,
edoburu/pgbouncer:1.22.1,
flyway/flyway:10-alpine, valkey/valkey:8.0).
Before updating a tag, review the upstream changelog for breaking
configuration changes, then update docker-compose.yml and
re-run the full bring-up and verification sequence in a non-production
environment first.