Temporal Orchestration

AMTP uses Temporal as its workflow orchestration engine. Temporal provides a deterministic event-history model, durable execution across worker restarts, and a structured activity retry framework. The AMTP workflow (TestGenerationWorkflow) is responsible for sequencing the three LLM agent activities, enforcing context resets between them, and managing terminal failure states.

Workflow Determinism Requirements #

Temporal replays workflow history to rebuild state after a worker restart. Any non-determinism in the workflow code breaks replay. All non-deterministic operations are therefore delegated to activities.

Run & Stage Lifecycle #

State transitions are persisted to Postgres. The allowed values are enforced by CHECK constraints in the schema. See migrations/sql/V2__runs.sql and migrations/sql/V3__stages.sql.

Run States (runs.status) #

pending
running
passed
failed
cancelled
State Set by Description
pending Webhook receiver Row created; workflow not yet dispatched to Temporal.
running Workflow start signal Temporal workflow has started; at least one stage is active.
passed Workflow completion PR created successfully; all stages completed.
failed Workflow terminal error Non-retryable failure (e.g. StaleBaseBranch, SchemaValidationError).
cancelled External signal / user action Workflow cancelled before completion.

Stage States (stages.status) #

Standard path

pending
running
passed
failed
skipped
cancelled

Branch protection path

running
awaiting_approval
passed
failed
State Description
pending Stage queued; activity not yet scheduled.
running Temporal activity executing.
awaiting_approval Activity blocked on an approvals row (e.g. BranchProtectionViolation).
passed Activity completed successfully.
failed Activity failed terminally.
skipped Stage bypassed by workflow logic (future feature).
cancelled Cancelled before execution.

LLM Output Sanitization #

LLM providers frequently wrap JSON responses in Markdown code fences. The deterministic SanitizeLlmOutput step runs between the raw LLM response and JSON parsing.

Order of operations per agent activity:

01 LLMCall raw string
02 SanitizeLlmOutput stripped string ↳ MalformedLlmOutput
03 JSON.parse object
04 JSONSchemaValidate validated object ↳ SchemaValidationError
05 PersistArtifact Postgres artifacts

The sanitizer is a pure function: strips one leading ```json or ``` fence, strips one trailing ```, trims whitespace. No content mutation. The sanitizer semver is recorded in artifacts.content.meta.sanitizer for replay reproducibility. For full specification see Agents § SanitizeLlmOutput.

Activity Idempotency Contract #

Every Temporal activity that produces a side effect is designed to be safely re-executed. The idempotency key for each activity is derived deterministically:

idem_key = sha256( run_id || ":" || activity_name || ":" || canonical_json(input) )

canonical_json = keys sorted lexicographically, no extra whitespace. The key is stored in a future activity_idempotency table (not present in V1–V6 migrations; documented as required future schema).

Per-Activity Rules #

Activity Idempotency mechanism Replay behavior
CrawlRepo (run_id, ref, depth_level) → check for existing artifacts.kind='repo_crawler_output' before LLM call. If artifact exists, return its artifact_id; skip LLM call entirely.
GenerateTestCases Upstream artifact_id + depth_level form the idem key. Same upstream SHA ↠ reuse existing test_case_generator_output artifact.
GenerateTestCode Upstream artifact_id + target_framework. Same inputs ↠ reuse existing test_engineer_output artifact.
CreatePullRequest Git Trees API: tree SHA is deterministic from file contents. POST /git/refs with idempotency key header. Valkey pr_lock for serialization (see below). 409 (ref exists) → return existing PR URL; non-retryable short-circuit.
IncrRepoConcurrency Paired with DecrRepoConcurrency via run_id; Valkey safety EXPIRE 3600. INCR is safe on retry; double-increment prevented by run_state gate in the app layer.
RateLimitCheck ZADD member = run_id; duplicate member with same score is idempotent in sorted sets. Replay does not inflate the sliding-window count.

Side-Effect Isolation Rule #

All activities that touch external systems (GitHub API, LLM provider, Valkey, Postgres) must be at-most-once observable. Implementation requirements:

  • All Postgres writes use INSERT ... ON CONFLICT DO NOTHING or compare-and-swap via unique constraints.
  • All Valkey writes for rate-limiting use the semantics documented in infra/valkey/NAMESPACES.md.
  • GitHub writes use the Git Trees API (append-only; existing tree SHAs are reused).
  • Pure reads are unrestricted and require no idempotency handling.

Activity Retry Policy #

Parameter Value
Initial interval 2 seconds
Backoff coefficient 2.0
Maximum interval 30 seconds
Maximum attempts 20
Non-retryable error types StaleBaseBranch, BranchProtectionViolation, SchemaValidationError, GitHubForbidden, RateLimitExceeded

Per-Repository PR Serialization #

Concurrent runs targeting the same repository must not race to open duplicate pull requests. AMTP enforces serialization using a Valkey distributed lock. All activities, including CreatePullRequest, run on the single static task queue amtp-activities. There are no dynamic task queues.

Valkey pr_lock Protocol #

Before performing any GitHub API write, CreatePullRequest acquires the per-repo lock:

SET amtp:rl:repo:{repo}:pr_lock {run_id} NX EX 120
  • NX — only sets if the key does not exist (atomic acquisition).
  • EX 120 — 120-second TTL as a safety net against a crashed activity that never releases the lock.
  • If SET NX returns nil (key held by another run), the activity raises RepoPrLockContended (retryable) and defers to Temporal’s standard activity backoff.

The lock is always released in a finally path using a Lua compare-and-delete script to ensure only the holder can release it:

-- Lua CAS release: only delete if the value matches our run_id
if redis.call("GET", KEYS[1]) == ARGV[1] then
  return redis.call("DEL", KEYS[1])
else
  return 0
end

CreatePullRequest Failure Classes #

Error HTTP / source Retryable Outcome
RepoPrLockContended Valkey SET NX returns nil Yes — backoff Temporal retries with 2s initial interval, coefficient 2, max 30s interval.
GitTreesApi409 GitHub 409 (ref already exists) No Idempotent replay: return existing PR artifact_id.
StaleBaseBranch GitHub 422 (base tree out of date) No Terminal workflow failure (see section below).
BranchProtectionViolation GitHub 403 No Writes approvals row; stage → awaiting_approval.
GitHubForbidden GitHub 403 (not branch protection) No Terminal; token scope or repository permission issue.

Design rationale. No tree-rebase logic. Serialization is enforced by the Valkey lock, not by dynamic task-queue routing. Different repositories operate in full parallelism; only concurrent runs on the same repository are serialized.

StaleBaseBranch — Terminal Workflow State #

A GitHub 422 response with a “base tree is out of date” body indicates that base_branch has advanced since the Repo Crawler captured the commit SHA. Because AMTP does not perform tree rebasing, this is a terminal non-retryable failure.

State Transitions on StaleBaseBranch #

  1. Temporal activity fails with StaleBaseBranch (non-retryable).
  2. Parent workflow catches the failure and writes:
    • runs.status = 'failed', runs.finished_at = now()
    • stages.status = 'failed' for the active CreatePullRequest stage
    • A new artifact row: artifacts.kind = 'failure_report' with:
      {
        "error": "StaleBaseBranch",
        "base_branch": "<branch>",
        "observed_head": "<sha-at-crawl-time>",
        "expected_parent": "<sha-used-in-git-tree>"
      }
  3. The Temporal workflow execution terminates in a FAILED state.

Recovery Protocol #

Recovery requires provisioning an entirely new pipeline execution via an external trigger:

  1. An external webhook fires — typically a subsequent push event on base_branch, or an explicit user-initiated POST /runs API call.
  2. The receiver provisions a new run_id via gen_random_uuid() (per migrations/sql/V2__runs.sql).
  3. The Repo Crawler re-crawls the repository against the new base_branch HEAD, producing a fresh repo_crawler_output artifact tied to the new run_id.
  4. The full pipeline proceeds as a fully independent workflow instance.

Rationale. A re-crawl against the advanced HEAD is the only way to guarantee the Test Case Generator and Test Engineer reason about the same source-of-truth that the PR targets. Any in-place rebase would require a second LLM round-trip, invalidating the stateless-agent context-reset guarantee and the idempotency of the crawler output artifact.

The StaleBaseBranch failure is observable in Postgres (runs.status = 'failed' with a failure_report artifact) and should be surfaced by the operational alerting layer.

GitHub Branch-Protection Contract #

Token Scope (Documented Assumption) #

The GitHub App installation token used by CreatePullRequest is provisioned with the following permissions:

Permission Scope
contents write — required to create trees, commits, and refs via the Git Trees API.
pull_requests write — required to open PRs.
bypass_branch_protections Not granted.
administration Not granted.
workflows Not granted.
secrets Not granted.

Branch-Protection Rules AMTP Honors #

Rule AMTP behavior
Required reviews AMTP opens the PR. Merge is out of scope. Review gate is delegated to humans and the approvals table (migrations/sql/V5__approvals.sql).
Required status checks AMTP never bypasses them. AMTP contributes no CI checks of its own (it does not execute tests).
Linear history / squash-only Irrelevant to AMTP; it does not merge PRs.
Block force-pushes AMTP only creates a new head ref via the Trees API. No refs are force-updated.
Restrict who can push The GitHub App identity must be allow-listed if this rule is enabled. Otherwise the activity fails with BranchProtectionViolation.
Signed commits required Out of current scope. GitHub App commits via Trees API are attributed to the app; GPG signing requires separate key provisioning. Tracked as a future requirement.

BranchProtectionViolation Handling #

A GitHub 403 response classified as a branch-protection error is non-retryable. The workflow:

  1. Writes an approvals row with decision = 'pending', approver = 'branch-protection', and comment = <github error body>.
  2. Transitions the active stage to awaiting_approval.
  3. Pauses the Temporal workflow pending an external signal.
  4. Resumes only when an external signal updates the approvals row to approved (and the operator has resolved the underlying protection issue).

awaiting_approval is a valid stages.status CHECK constraint value. See migrations/sql/V3__stages.sql.