Why replayability matters for internal automations
Internal automations fail in boring ways: a worker restarts mid-run, a webhook retries, a queue redelivers, or a human clicks “run” twice. The hardest part is not making the job run. It’s making the job safe to run again.
“Exactly-once” is the goal: each logical business action happens once, even if the automation is executed multiple times. In practice, exactly-once is achieved with a set of patterns: deduplication keys, checkpointing, and a poison-message quarantine. Together, they let you replay confidently, recover quickly, and keep bad inputs from blocking good work.
Define the unit of work before you design the safeguards
Exactly-once semantics depend on what “once” means. Decide the smallest business effect you must not duplicate. Examples:
- “Create an invoice for order 123.”
- “Send the renewal email for contract ABC for the May 2026 cycle.”
- “Post a journal entry for payout batch X.”
This definition drives your dedup key, your checkpoints, and the data you need to store to make replay deterministic.
Dedup keys that survive retries and replays
A dedup key is a stable identifier for the logical action. If the automation sees the same key again, it must return the previous result or no-op safely.
What makes a good dedup key
- Deterministic: derived from business identifiers, not timestamps or random UUIDs.
- Scoped: includes the action type and any cycle boundaries (daily run, billing period, etc.).
- Collision-resistant: avoid “userId” alone; prefer compound keys like
renewal_email:contractId:2026-05. - Queryable: you can look it up quickly in a durable store.
Where to enforce dedup
Enforce dedup at the boundary where side effects begin. Typical options:
- Your database: a table with a unique constraint on
dedup_keyand a stored result snapshot. - The target system: some APIs support idempotency keys; still keep your own record for observability and audits.
- The workflow engine: useful for run-level dedup, but you still need a durable record for true replayability.
In code-first platforms such as windmill.dev, you typically implement the business-level dedup in your database (unique index + transaction) and let the platform handle retries, scheduling, and run history.
Store outcomes, not just “seen keys”
Dedup should not only block duplicates; it should let you answer “what happened last time?” Store:
- status (started/succeeded/failed/quarantined)
- timestamps and attempt count
- result reference (e.g., created invoice ID)
- hash of key inputs for auditing and change detection
Checkpointing to make long jobs safely restartable
Dedup keys prevent duplicated business actions. Checkpointing prevents duplicated work inside a single long-running action. Without checkpoints, a retry may reprocess thousands of items and re-trigger side effects unless everything is perfectly idempotent.
Choose the checkpoint granularity
Common granularities:
- Batch-level: “processed pages 0–9” with a cursor.
- Item-level: a row per item keyed by
dedup_key + item_id. - Phase-level: “fetched inputs,” “validated,” “wrote to DB,” “notified downstream.”
The right choice depends on cost. Item-level checkpoints are the safest for large fan-out pipelines. Phase-level checkpoints are simpler for multi-step flows with a few expensive calls.
Make checkpoints durable and monotonic
A checkpoint must be written to a store that survives worker restarts and deploys. It must also be monotonic: you only move forward. If you store a cursor, treat it as “last confirmed processed,” not “next to process” to avoid off-by-one duplication after a crash.
Replay strategy: re-run from the last safe boundary
When a run restarts, you do not want to “resume” in memory. You want to re-derive state from the checkpoint and re-execute deterministically. That means:
- inputs are logged or re-fetchable
- side effects are protected by dedup
- each step is safe to execute again
For teams that struggle with scattered pings and ad-hoc requests, having a single intake path and explicit run records helps keep replays accountable. The same idea appears in an issue intake contract for a single prioritized backlog, applied here to automation events and failures.
Poison-message quarantines to protect throughput
Some inputs will never succeed: malformed payloads, missing permissions, unexpected schema changes, or external systems returning permanent errors. If your queue keeps retrying these forever, you create backlog pressure and hide real incidents.
Define “poison” precisely
Quarantine rules should be explicit. Examples:
- Permanent errors: 400/422 responses, validation failures, missing required fields.
- Repeated transient errors: timeouts or 5xx errors that exceed a retry budget.
- Invariant violations: dedup key already succeeded but the payload hash changed unexpectedly.
What to store in quarantine
- original message/payload (or a secure reference if sensitive)
- error classification and stack trace
- dedup key and correlation IDs
- the exact checkpoint/phase where it failed
- next action: drop, fix-and-replay, manual review
Quarantine is not a dead-letter queue you forget. It’s an operational inbox with ownership and a defined resolution workflow.
Exactly-once in the real world is “effectively once”
Most systems cannot guarantee global exactly-once delivery across network boundaries. What you can guarantee is exactly-once effects for the business action you defined. The practical recipe looks like this:
- Dedup key: ensures one business effect per logical action.
- Checkpointing: ensures safe restarts without repeating expensive or risky sub-steps.
- Quarantine: prevents bad inputs from consuming unlimited retries and blocking the queue.
Implementation sketch you can adapt
A minimal, durable implementation usually includes three tables:
- runs:
dedup_key(unique), status, result metadata. - checkpoints:
dedup_key, phase/cursor, updated_at. - quarantine: payload reference, error, classification, owner, resolution state.
Then structure the automation like a transaction around side effects:
- Attempt to
INSERTthe run record with the dedup key. If it already exists and succeeded, return the stored outcome. - Read checkpoint. Process the next unit. Write checkpoint only after the unit is safely committed.
- Classify errors. Retry transient failures with backoff. Quarantine permanent failures with enough context to replay after fixing.
Operational details that make the patterns stick
- Idempotent notifications: treat Slack/email/webhooks as side effects with their own dedup keys.
- Observability: log the dedup key in every step so you can trace a replay end-to-end.
- Schema evolution: version your payloads; store a payload hash to detect unexpected changes.
- Runbooks: define who owns quarantine triage and how “fix-and-replay” is performed safely.
When these pieces are in place, replays stop being scary. They become a normal operational tool: safe, auditable, and fast to recover.



