Infrastructure

AMTP’s local infrastructure is defined in docker-compose.yml and docker-compose.override.yml. All services share the amtp_net bridge network. The override file restricts select port bindings to loopback (127.0.0.1) for local development. The observability stack runs as a separate Compose project and also attaches to amtp_net as an external network — see Observability.

Service Map #

Service	Image	Host port	Container port	Status
`postgres`	`postgres:15-alpine`	`5432` (override only)	`5432`	Implemented
`pgbouncer`	`edoburu/pgbouncer:1.22.1`	`6432`	`5432`	Implemented
`flyway`	`flyway/flyway:10-alpine`	—	—	Implemented (opt-in `migrate` profile)
`valkey`	`valkey/valkey:8.0`	`6379` (override only)	`6379`	Implemented
`healthcheck`	Built from apps/healthcheck/Dockerfile	`8083` (override) / `8080`	`8080`	Implemented
`github-mcp`	Built from apps/github-mcp/Dockerfile	`8090`	`8090`	Implemented
`docs`	`nginx:1.27-alpine`	`${DOCS_HOST_PORT:-80}`	`80`	Implemented
Garage S3	Garage (S3-compatible)	`3900`	`3900`	Planned

PostgreSQL 15 #

Primary persistence layer. Stores all run state, stage state, artifacts, approvals, and project registrations. migrations/sql/V1__projects.sql through migrations/sql/V6__indexes.sql define the complete schema.

Configuration #

postgres:
  image: postgres:15-alpine
  restart: unless-stopped
  environment:
    POSTGRES_DB:            ${POSTGRES_DB:-amtp}
    POSTGRES_USER:          ${POSTGRES_USER:-amtp}
    POSTGRES_PASSWORD:      ${POSTGRES_PASSWORD:-amtp}
    POSTGRES_INITDB_ARGS:   "--auth-host=scram-sha-256 --auth-local=scram-sha-256"
  command:
    - "postgres"
    - "-c" - "password_encryption=scram-sha-256"
    - "-c" - "max_connections=200"
    - "-c" - "shared_buffers=256MB"
    - "-c" - "work_mem=16MB"
    - "-c" - "log_min_duration_statement=500"
  volumes:
    - amtp_pgdata:/var/lib/postgresql/data
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-amtp} -d ${POSTGRES_DB:-amtp}"]
    interval: 5s
    timeout:  3s
    retries:  20

docker-compose.yml § postgres (L8–L39)

Tuning notes #

scram-sha-256 enforced for both host and local auth.
max_connections = 200 — applications connect via PgBouncer (not directly) to stay well below this ceiling.
shared_buffers = 256 MB, work_mem = 16 MB — dev-tuned values; increase for production.
Queries slower than 500 ms are logged. Reference: infra/postgres/postgresql.conf.

PgBouncer 1.22 #

Connection pooler in transaction mode. Applications connect to pgbouncer:5432 (internal) or localhost:6432 (override). PgBouncer multiplexes up to 500 client connections onto a default pool of 25 server connections.

Key parameters #

Parameter	Value	Notes
`POOL_MODE`	`transaction`	Connection released to pool after each transaction, not each session.
`MAX_CLIENT_CONN`	`500`	Maximum simultaneous client connections.
`DEFAULT_POOL_SIZE`	`25`	Server connections per database/user pair.
`RESERVE_POOL_SIZE`	`5`	Extra connections for spikes; available after `RESERVE_POOL_TIMEOUT = 3s`.
`AUTH_TYPE`	`scram-sha-256`	Matches Postgres authentication method.
`MAX_PREPARED_STATEMENTS`	`100`	Limits prepared-statement caching in transaction mode.
`SERVER_IDLE_TIMEOUT`	`240s`	Idle server connections closed after 4 minutes.

Note on prepared statements. Transaction-mode pooling is incompatible with session-persistent prepared statements. Applications must use named-prepare-per-transaction patterns or rely on the MAX_PREPARED_STATEMENTS caching mechanism.

docker-compose.yml § pgbouncer (L41–L67)

Flyway 10 (Migration Runner) #

Schema migrations are managed by Flyway. The service is defined under the opt-in migrate Compose profile; it does not start automatically with docker compose up. See the Deployment Runbook for explicit invocation instructions.

Configuration #

flyway.url=jdbc:postgresql://postgres:5432/amtp
flyway.user=amtp
flyway.schemas=public
flyway.locations=filesystem:/flyway/sql
flyway.baselineOnMigrate=true
flyway.validateOnMigrate=true
flyway.cleanDisabled=true

migrations/flyway.conf

cleanDisabled=true — flyway clean is permanently disabled to prevent accidental schema destruction in any environment.
validateOnMigrate=true — checksums of previously applied migrations are verified on every run.
baselineOnMigrate=true — allows Flyway to baseline an existing schema on first run.
V6__indexes.sql.conf sets executeInTransaction=false because CREATE INDEX statements cannot run inside a transaction block in Postgres.

Valkey 8.0 (Cache & Rate-Limit Layer) #

Single-node Valkey instance serving three purposes: MCP API response caching, per-repo concurrency control, and sliding-window user rate limiting. It is the core token-accounting infrastructure.

Eviction & Persistence Policy (Required Flags) #

valkey:
  image: valkey/valkey:8.0
  command:
    - "valkey-server"
    - "--save"         - ""           # disable RDB snapshots
    - "--appendonly"   - "no"         # disable AOF persistence
    - "--requirepass"  - "${VALKEY_PASSWORD}"
    - "--maxmemory"    - "512mb"      # hard memory cap
    - "--maxmemory-policy" - "allkeys-lru"  # evict LRU keys when cap is reached

docker-compose.yml § valkey (L83–L109)

Flag	Value	Rationale
`--maxmemory`	`512mb`	Hard upper bound for the in-memory store. Prevents unbounded memory growth.
`--maxmemory-policy`	`allkeys-lru`	Evicts least-recently-used keys across all namespaces when the cap is reached. Chosen over `volatile-lru` because rate-limit sorted-set keys (`amtp:rl:user:*:runs`) may not carry a TTL at the moment of eviction pressure.
`--save ""`	(empty — disables RDB)	Valkey is intentionally ephemeral. No RDB snapshot file is written.
`--appendonly no`	`no`	AOF persistence disabled. Data is not durable across container restarts by design.

OOM failure mode if `allkeys-lru` is omitted

Without --maxmemory-policy allkeys-lru, once maxmemory is reached Valkey rejects all write commands with:

OOM command not allowed when used memory > 'maxmemory'

The healthcheck’s PING still returns PONG in this state, so a PING-only probe will not detect the degradation. Application-layer SET, ZADD, and INCR calls will fail silently from the probe’s perspective. This is precisely why the healthcheck requires a synthetic write (see Healthcheck section below).

Key Namespaces & TTL Contract #

All keys are scoped under the amtp: root namespace. Reference: infra/valkey/NAMESPACES.md.

Purpose	Key pattern	TTL
GitHub MCP `repo.tree` result cache	`amtp:mcp:tree:{owner}:{repo}:{ref}`	600 s
GitHub MCP `.gitignore` blob cache	`amtp:mcp:blob:{sha}`	3600 s
GitHub MCP absent `.gitignore` sentinel	`amtp:mcp:null_gitignore:{owner}:{repo}:{ref}:{path}`	600 s
User rate limit (sliding window)	`amtp:rl:user:{userId}:runs`	3600 s sliding
Repo concurrency counter	`amtp:rl:repo:{repo}:concurrency`	3600 s safety net
PR serialization lock	`amtp:rl:repo:{repo}:pr_lock`	120 s (per acquisition)
Prompt cache	`amtp:prompt:{model}:{sha256(prompt)}`	3600 s

Sliding-Window Rate Limit #

A single stable sorted-set key per user (amtp:rl:user:{userId}:runs) backs the per-hour run limit. The calling worker computes now and cutoff client-side and injects them as literal epoch-millisecond integers.

const now    = Date.now();
const cutoff = now - 3_600_000;
const key    = `amtp:rl:user:${userId}:runs`;

const results = await client
  .multi()
  .zremrangebyscore(key, 0, cutoff)   // evict entries older than 1 h
  .zadd(key, now, runId)              // score = epoch ms; member = run id
  .zcard(key)                         // count of runs in current window
  .expire(key, 3600)                  // safety TTL
  .exec();

const count = results[2][1]; // [err, value] pairs; index 2 = ZCARD

Reference: infra/valkey/NAMESPACES.md § Sliding-Window Rate-Limit Semantics

All four commands execute atomically via MULTI/EXEC. Literal integers, not expressions, are on the wire.

Healthcheck Service (Node.js / Express / ioredis) #

Lightweight Express service that probes Valkey availability and exposes the result as a structured JSON response. Used by load-balancers and monitoring pipelines to gate on cache availability. Source: apps/healthcheck/src/server.js and apps/healthcheck/src/valkey.js.

Endpoint #

GET /health

Condition	Status	Body
All three probe steps succeed within 500 ms	`200 OK`	`{ "status": "ok", "valkey": "up", "latency_ms": <n> }`
Timeout or any probe step fails	`503 Service Unavailable`	`{ "status": "degraded", "valkey": "down", "error": "...", "probe": "<step>", "latency_ms": <n> }`

Three-Step Write Probe (Required Contract) #

Required future change. The current implementation in apps/healthcheck/src/server.js issues a PING-only probe. This does not detect Valkey OOM-write-rejection state. The service must be updated to execute the three-step probe below.

The required probe sequence, executed within a single 500 ms timeout window:

PING — verifies Valkey liveness.
SET _amtp_health_ping 1 EX 1 — unconditional overwrite. Verifies write availability. No NX flag: concurrent probes within the 1-second TTL window must not cause false negatives.
GET _amtp_health_ping — verifies round-trip read path. Fails with "probe read mismatch" if the returned value is not "1".

Why `NX` was rejected

With NX (Set if Not eXists), a second probe fired within the 1-second TTL window observes the key as existing, receives nil from SET NX, and a naive implementation treats this as a write failure, returning 503 degraded even though Valkey is fully healthy. Unconditional SET always returns OK when writes are accepted, regardless of concurrent probes.

Reference implementation (not yet applied to repository)

const PROBE_KEY = "_amtp_health_ping";

app.get("/health", async (req, res) => {
  const start = performance.now();
  try {
    const timeout = new Promise((_, reject) =>
      setTimeout(() => reject(new Error("Timeout")), 500)
    );

    await Promise.race([
      (async () => {
        await client.ping();                              // step 1: liveness
        await client.set(PROBE_KEY, "1", "EX", 1);      // step 2: write (no NX)
        const echo = await client.get(PROBE_KEY);        // step 3: read
        if (echo !== "1") throw new Error("probe read mismatch");
      })(),
      timeout,
    ]);

    const latency_ms = Number.parseFloat((performance.now() - start).toFixed(2));
    return res.status(200).json({ status: "ok", valkey: "up", latency_ms });
  } catch (err) {
    const latency_ms = Number.parseFloat((performance.now() - start).toFixed(2));
    return res.status(503).json({
      status: "degraded", valkey: "down",
      error: err.message, latency_ms,
    });
  }
});

Replaces current apps/healthcheck/src/server.js implementation.

Probe key hygiene

The probe key _amtp_health_ping intentionally lives outside the amtp: namespace (per infra/valkey/NAMESPACES.md § Key Naming Conventions) to prevent any collision with application keys. It carries a 1-second TTL and requires no explicit delete step.

Environment Variables #

Variable	Default	Description
`VALKEY_HOST`	`valkey`	Hostname of the Valkey instance.
`VALKEY_PORT`	`6379`	Port of the Valkey instance.
`VALKEY_PASSWORD`	—	Auth password (`requirepass` value). Required.
`PORT`	`8080`	Port this service listens on. Host override maps to `8083`.

Reference: apps/healthcheck/README.md, .env.example.

Dockerfile #

FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm install --omit=dev

FROM node:20-alpine
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY package.json ./
COPY src/ ./src/
USER node
EXPOSE 8080
CMD ["node", "src/server.js"]

apps/healthcheck/Dockerfile — two-stage build; runs as unprivileged node user.

GitHub MCP Server (`amtp-github-mcp`) #

A stateless Streamable-HTTP MCP server that exposes two tools — repo.tree and repo.read_file — over a GitHub App installation. It connects to Valkey for caching (see namespaces above) and to the OTel Collector for traces and metrics.

Environment variables #

Variable	Default	Description
`GITHUB_APP_ID`	—	GitHub App numeric ID. Required.
`GITHUB_APP_INSTALLATION_ID`	—	Installation ID for the target org/user. Required.
`MCP_HTTP_PORT`	`8090`	Port the MCP HTTP server listens on.
`VALKEY_HOST` / `VALKEY_PORT` / `VALKEY_PASSWORD`	`valkey` / `6379` / —	Valkey connection. Password required.
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://otel-collector:4318`	OTLP HTTP endpoint for traces and metrics.
`OTEL_SERVICE_NAME`	`amtp-github-mcp`	Service name attributed in Tempo / Grafana.

The GitHub App private key is mounted as a Docker secret at /run/secrets/github_app_key (source file: ./secrets/github_app_key.pem). See Deployment → Prerequisites.

Docs Service (nginx) #

Static documentation server based on nginx:1.27-alpine. Serves the compiled docs/AMTP_Docs_Website/ directory over HTTP. Host port is parameterized via DOCS_HOST_PORT (default 80). Configuration: infra/nginx/default.conf.

nginx configuration highlights #

gzip enabled for text/html, text/css, application/javascript, application/json, and image/svg+xml.
Security headers: X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Referrer-Policy: strict-origin-when-cross-origin.
/healthz returns 200 OK with body ok. Used by docker-compose.yml healthcheck and CI smoke tests.
DAC requirement: all files under docs/AMTP_Docs_Website/ and infra/nginx/ must be readable by the nginx worker (chmod -R a+rX applied in CI before serving).

Override file restricts postgres and valkey to loopback only. docker-compose.override.yml rebinds postgres (port 5432) and valkey (port 6379) to 127.0.0.1 so they are not reachable outside the dev host. The docs service is not in the override file — it binds to all interfaces on port ${DOCS_HOST_PORT:-80} so the documentation site is reachable from other machines on the local network.

Garage S3 (Artifact Storage) #

Garage is an S3-compatible object store provisioned for AMTP test bundle storage. It is a lightweight, self-hosted alternative to AWS S3.

Parameter	Value
API port	`3900`
Object retention	7 days (lifecycle policy on the test-bundle bucket)
API compatibility	S3-compatible (AWS SDK v2/v3 compatible)
Usage	Storage of generated test bundle archives referenced by `artifacts.storage_uri` (see migrations/sql/V4__artifacts.sql)

Network & Volumes #

amtp_net: Bridge network. All services communicate by service name (Docker internal DNS). No service is exposed to the host by default except via the override file.; Network pinning. The network name is amtp_net — fixed by COMPOSE_PROJECT_NAME=amtp in .env. Changing the project name changes the network name and breaks all cross-stack service discovery. The observability stack in docker-compose.observability.yml declares amtp_net: external: true and attaches to this exact name. The main stack must be brought up at least once before the observability stack deploys. See Observability → Network Dependency.
amtp_pgdata: Named volume for Postgres data directory (/var/lib/postgresql/data). Persists across container restarts. Must be explicitly removed with docker compose down -v to reset.

Infrastructure

Service Map #

PostgreSQL 15 #

Configuration #

Tuning notes #

PgBouncer 1.22 #

Key parameters #

Flyway 10 (Migration Runner) #

Configuration #

Valkey 8.0 (Cache & Rate-Limit Layer) #

Eviction & Persistence Policy (Required Flags) #

OOM failure mode if allkeys-lru is omitted

Key Namespaces & TTL Contract #

Sliding-Window Rate Limit #

Healthcheck Service (Node.js / Express / ioredis) #

Endpoint #

Three-Step Write Probe (Required Contract) #

Why NX was rejected

Reference implementation (not yet applied to repository)

Probe key hygiene

Environment Variables #

Dockerfile #

GitHub MCP Server (amtp-github-mcp) #

Environment variables #

Docs Service (nginx) #

nginx configuration highlights #

Garage S3 (Artifact Storage) #

Network & Volumes #

OOM failure mode if `allkeys-lru` is omitted

Why `NX` was rejected

GitHub MCP Server (`amtp-github-mcp`) #