Deployment Runbook

This runbook covers local and self-hosted deployment of the AMTP infrastructure stack. All commands use Docker Compose. Secrets are managed via a .env file that must never be committed to the repository.

Dev-host Bootstrap Runbook #

A fresh RHEL / Rocky / Alma Linux dev host requires a one-time bootstrap before any CI/CD job can run. The infra/bootstrap.sh script handles Docker CE + Compose installation, firewall ports, and runner-user Docker group membership.

Step-by-step #

Run bootstrap script.
```
sudo ./infra/bootstrap.sh <runner_user>
```
Installs Docker CE + Compose, opens firewalld ports 80 and 8083, adds <runner_user> to the docker group, and restarts the runner service.
Register the self-hosted runner. In the GitHub repository, go to Settings → Actions → Runners → New self-hosted runner. Follow the on-screen instructions to download and configure the runner with label amtp-dev.
Configure secrets. Add all secrets listed in CI/CD → Required GitHub Secrets & Vars to Settings → Environments → dev.
Create .env. On the runner host, copy .env.example to .env and fill in production values. See Creating .env below.
Provision GitHub App PEM secret. Base64-encode your GitHub App private key and add it as a GitHub Actions secret named GH_APP_PRIVATE_KEY_B64 in Settings → Environments → dev. Also add GH_APP_ID and GH_APP_INSTALLATION_ID. On the runner host, the deploy job decodes the key to secrets/github_app_key.pem automatically.
Bring up the main stack first. Before the observability stack can deploy, amtp_net must exist. Run:
```
docker compose up -d --wait postgres pgbouncer valkey healthcheck docs
```
Creates amtp_net and brings all main-stack services to healthy. The observability CI/CD workflow can now attach to the network.
Trigger CI. Push to dev or run Actions → CI/CD → Run workflow. All five child workflows will execute. On first run, the observability workflow will deploy Grafana, Tempo, Prometheus, and the OTel Collector, then run the post-deploy trace verification.

amtp_net must exist before the observability stack deploys. docker-compose.observability.yml declares amtp_net as external: true. If the network does not exist when the observability workflow runs, Docker will fail with:

network amtp_net declared as external, but could not be found

Remediation paths (apply one):

Bootstrap-level: Add docker network create amtp_net 2>/dev/null || true to infra/bootstrap.sh so the network pre-exists any CI run (idempotent — safe to run repeatedly).
Workflow-level: Add needs: healthcheck to the observability job in .github/workflows/ci-cd.yml to enforce the main stack as a precondition.

See Observability → Network Dependency.

Prerequisites #

Docker Engine ≥ 24 and Docker Compose v2 installed.
Repository checked out at the target commit.
.env file created from .env.example with production values (see below).
Outbound network access to Docker Hub for image pulls (on first run).

Creating `.env` #

cp .env.example .env
chmod 600 .env
# Edit .env and replace all placeholder values:
#   POSTGRES_PASSWORD=change-me  ->  a strong random password
#   VALKEY_PASSWORD=change-me    ->  a strong random password

Template: .env.example

POSTGRES_DB=amtp
POSTGRES_USER=amtp
POSTGRES_PASSWORD=change-me
COMPOSE_PROJECT_NAME=amtp
VALKEY_HOST=valkey
VALKEY_PORT=6379
VALKEY_PASSWORD=change-me
DOCS_HOST_PORT=80

# GitHub MCP server
GITHUB_APP_ID=
GITHUB_APP_INSTALLATION_ID=
# PEM key is mounted via Docker secret: ./secrets/github_app_key.pem

.env.example — all values must be changed before first deployment.

Never commit .env to the repository. It is listed in .gitignore. Verify with git status before every commit.

Bring-up Order #

Services must be started in the following order. PgBouncer and Valkey depend on Postgres and Valkey being healthy respectively; the depends_on: condition: service_healthy clauses in docker-compose.yml enforce this automatically.

Step 1 — Start Postgres #

docker compose up -d postgres

Wait for the healthcheck to pass (up to 100 seconds: 20 retries × 5 s interval):

docker compose ps postgres   # look for "(healthy)" in the STATUS column

Step 2 — Run Flyway migrations #

The flyway service uses the opt-in migrate profile and runs as a one-shot container. It will exit 0 on success.

docker compose --profile migrate run --rm flyway migrate
# Verify migration state:
docker compose --profile migrate run --rm flyway info

Expected output: all six migrations (V1–V6) listed with state Success. If any migration shows Failed, do not proceed. Review the Flyway logs and consult Data Model for the offending migration.

Step 3 — Start PgBouncer #

docker compose up -d pgbouncer

PgBouncer waits for Postgres to be healthy before starting (depends_on: condition: service_healthy).

Step 4 — Start Valkey #

docker compose up -d valkey

Wait for the Valkey healthcheck to pass:

docker compose ps valkey   # look for "(healthy)"

Step 5 — Start Healthcheck service #

docker compose up -d --build healthcheck

The healthcheck service depends on Valkey being healthy before starting.

One-shot full stack (after first migration) #

Once migrations have been applied at least once, the entire stack (excluding Flyway) can be brought up in one command:

docker compose up -d postgres pgbouncer valkey healthcheck

Verification #

Healthcheck probe #

curl -sf http://localhost:8083/health

Expected response (200 OK):

{ "status": "ok", "valkey": "up", "latency_ms": 0.42 }

If the response is 503, check Valkey logs: docker compose logs valkey. A response of "probe": "write" in the error body (after the three-step probe is implemented) indicates an OOM condition. A "valkey": "down" response from the current PING-only implementation indicates a connectivity failure.

Postgres connectivity #

# Via PgBouncer (applications use this path)
psql "postgresql://amtp:<password>@localhost:6432/amtp" -c "SELECT version();"

# Direct Postgres (override file required for port 5432)
psql "postgresql://amtp:<password>@localhost:5432/amtp" -c "\dt"

Valkey connectivity #

redis-cli -h localhost -p 6379 -a "<password>" PING
# Expected: PONG

redis-cli -h localhost -p 6379 -a "<password>" SET test_key "1" EX 5
redis-cli -h localhost -p 6379 -a "<password>" GET test_key
# Expected: 1

Migration status #

docker compose --profile migrate run --rm flyway info

All six migrations should show Success. No pending migrations should remain.

Running Migrations #

Flyway is the sole migration mechanism. Never apply schema changes manually via psql. All changes must go through versioned SQL files in migrations/sql/.

Apply new migrations #

docker compose --profile migrate run --rm flyway migrate

Check current state #

docker compose --profile migrate run --rm flyway info

Validate checksums #

docker compose --profile migrate run --rm flyway validate

Validation fails if any previously-applied migration file has been modified. Do not alter files in migrations/sql/ after they have been applied to any environment.

flyway.cleanDisabled=true is set in migrations/flyway.conf. The flyway clean command is permanently disabled to prevent accidental schema destruction. There is no override. To reset a development schema, drop and recreate the database manually, then run flyway migrate.

Rollback & Teardown #

Stop all services (preserve data) #

docker compose down

Stop and remove all data (full reset) #

docker compose down -v --remove-orphans

-v deletes the amtp_pgdata named volume. All Postgres data is permanently lost. Do not run this in any environment that holds data you need.

Database schema rollback #

Flyway Community Edition does not support undo migrations. The rollback procedure for a failed migration is:

Stop all application services.
Manually restore from the last known-good Postgres backup.
Redeploy from the last known-good migration version.

Secret Rotation #

Rotating `POSTGRES_PASSWORD` #

Connect to Postgres directly:

ALTER USER amtp WITH PASSWORD '<new-password>';

Update POSTGRES_PASSWORD in .env and in GitHub repository secrets.
Restart PgBouncer to pick up the new password:
```
docker compose restart pgbouncer
```

Rotating `VALKEY_PASSWORD` #

Update VALKEY_PASSWORD in .env and in GitHub repository secrets.

Restart Valkey and the healthcheck service:

docker compose up -d --force-recreate valkey healthcheck

Note: Valkey is ephemeral (no persistence). All cached data is lost on restart. MCP caches will be rebuilt on next use. Rate-limit state resets.

Pinned image tags #

All images in docker-compose.yml use pinned minor-version tags (postgres:15-alpine, edoburu/pgbouncer:1.22.1, flyway/flyway:10-alpine, valkey/valkey:8.0). Before updating a tag, review the upstream changelog for breaking configuration changes, then update docker-compose.yml and re-run the full bring-up and verification sequence in a non-production environment first.

Deployment Runbook

Dev-host Bootstrap Runbook #

Step-by-step #

Prerequisites #

Creating .env #

Bring-up Order #

Step 1 — Start Postgres #

Step 2 — Run Flyway migrations #

Step 3 — Start PgBouncer #

Step 4 — Start Valkey #

Step 5 — Start Healthcheck service #

One-shot full stack (after first migration) #

Verification #

Healthcheck probe #

Postgres connectivity #

Valkey connectivity #

Migration status #

Running Migrations #

Apply new migrations #

Check current state #

Validate checksums #

Rollback & Teardown #

Stop all services (preserve data) #

Stop and remove all data (full reset) #

Database schema rollback #

Secret Rotation #

Rotating POSTGRES_PASSWORD #

Rotating VALKEY_PASSWORD #

Pinned image tags #

Creating `.env` #

Rotating `POSTGRES_PASSWORD` #

Rotating `VALKEY_PASSWORD` #