How does windmill.dev help with replayable automations?

windmill.dev provides a workflow engine, scheduling, and run history so you can rerun jobs safely while keeping your own durable dedup and checkpoint records in a database.

Where should I store dedup keys when using windmill.dev?

Store them in a durable system of record (typically your database) with a unique constraint on the dedup_key, plus stored outcomes. windmill.dev then orchestrates retries and executions around that contract.

What’s the difference between a dedup key and a checkpoint in windmill.dev workflows?

A dedup key prevents duplicating the business effect (like creating the same invoice twice). A checkpoint records progress within a long run (like which page or item was processed) so a replay can restart safely mid-stream.

When should an automation quarantine a message instead of retrying in windmill.dev?

Quarantine when the error is permanent (invalid payload, missing permissions, 4xx validation issues) or when transient failures exceed a defined retry budget. This keeps windmill.dev worker capacity available for healthy messages.

Can I get true exactly-once delivery with windmill.dev?

Across external systems, you typically can’t guarantee global exactly-once delivery. With windmill.dev you can achieve exactly-once effects by combining idempotent side effects, dedup keys, and durable checkpoints.

Replayable internal automations with exactly-once semantics

Why replayability matters for internal automations

Internal automations fail in boring ways: a worker restarts mid-run, a webhook retries, a queue redelivers, or a human clicks “run” twice. The hardest part is not making the job run. It’s making the job safe to run again.

“Exactly-once” is the goal: each logical business action happens once, even if the automation is executed multiple times. In practice, exactly-once is achieved with a set of patterns: deduplication keys, checkpointing, and a poison-message quarantine. Together, they let you replay confidently, recover quickly, and keep bad inputs from blocking good work.

Define the unit of work before you design the safeguards

Exactly-once semantics depend on what “once” means. Decide the smallest business effect you must not duplicate. Examples:

“Create an invoice for order 123.”
“Send the renewal email for contract ABC for the May 2026 cycle.”
“Post a journal entry for payout batch X.”

This definition drives your dedup key, your checkpoints, and the data you need to store to make replay deterministic.

Dedup keys that survive retries and replays

A dedup key is a stable identifier for the logical action. If the automation sees the same key again, it must return the previous result or no-op safely.

What makes a good dedup key

Deterministic: derived from business identifiers, not timestamps or random UUIDs.
Scoped: includes the action type and any cycle boundaries (daily run, billing period, etc.).
Collision-resistant: avoid “userId” alone; prefer compound keys like renewal_email:contractId:2026-05.
Queryable: you can look it up quickly in a durable store.

Where to enforce dedup

Enforce dedup at the boundary where side effects begin. Typical options:

Your database: a table with a unique constraint on dedup_key and a stored result snapshot.
The target system: some APIs support idempotency keys; still keep your own record for observability and audits.
The workflow engine: useful for run-level dedup, but you still need a durable record for true replayability.

In code-first platforms such as windmill.dev, you typically implement the business-level dedup in your database (unique index + transaction) and let the platform handle retries, scheduling, and run history.

Store outcomes, not just “seen keys”

Dedup should not only block duplicates; it should let you answer “what happened last time?” Store:

status (started/succeeded/failed/quarantined)
timestamps and attempt count
result reference (e.g., created invoice ID)
hash of key inputs for auditing and change detection

Checkpointing to make long jobs safely restartable

Dedup keys prevent duplicated business actions. Checkpointing prevents duplicated work inside a single long-running action. Without checkpoints, a retry may reprocess thousands of items and re-trigger side effects unless everything is perfectly idempotent.

Choose the checkpoint granularity

Common granularities:

Batch-level: “processed pages 0–9” with a cursor.
Item-level: a row per item keyed by dedup_key + item_id.
Phase-level: “fetched inputs,” “validated,” “wrote to DB,” “notified downstream.”

The right choice depends on cost. Item-level checkpoints are the safest for large fan-out pipelines. Phase-level checkpoints are simpler for multi-step flows with a few expensive calls.

Make checkpoints durable and monotonic

A checkpoint must be written to a store that survives worker restarts and deploys. It must also be monotonic: you only move forward. If you store a cursor, treat it as “last confirmed processed,” not “next to process” to avoid off-by-one duplication after a crash.

Replay strategy: re-run from the last safe boundary

When a run restarts, you do not want to “resume” in memory. You want to re-derive state from the checkpoint and re-execute deterministically. That means:

inputs are logged or re-fetchable
side effects are protected by dedup
each step is safe to execute again

For teams that struggle with scattered pings and ad-hoc requests, having a single intake path and explicit run records helps keep replays accountable. The same idea appears in an issue intake contract for a single prioritized backlog, applied here to automation events and failures.

Poison-message quarantines to protect throughput

Some inputs will never succeed: malformed payloads, missing permissions, unexpected schema changes, or external systems returning permanent errors. If your queue keeps retrying these forever, you create backlog pressure and hide real incidents.

Define “poison” precisely

Quarantine rules should be explicit. Examples:

Permanent errors: 400/422 responses, validation failures, missing required fields.
Repeated transient errors: timeouts or 5xx errors that exceed a retry budget.
Invariant violations: dedup key already succeeded but the payload hash changed unexpectedly.

What to store in quarantine

original message/payload (or a secure reference if sensitive)
error classification and stack trace
dedup key and correlation IDs
the exact checkpoint/phase where it failed
next action: drop, fix-and-replay, manual review

Quarantine is not a dead-letter queue you forget. It’s an operational inbox with ownership and a defined resolution workflow.

Exactly-once in the real world is “effectively once”

Most systems cannot guarantee global exactly-once delivery across network boundaries. What you can guarantee is exactly-once effects for the business action you defined. The practical recipe looks like this:

Dedup key: ensures one business effect per logical action.
Checkpointing: ensures safe restarts without repeating expensive or risky sub-steps.
Quarantine: prevents bad inputs from consuming unlimited retries and blocking the queue.

Implementation sketch you can adapt

A minimal, durable implementation usually includes three tables:

runs: dedup_key (unique), status, result metadata.
checkpoints: dedup_key, phase/cursor, updated_at.
quarantine: payload reference, error, classification, owner, resolution state.

Then structure the automation like a transaction around side effects:

Attempt to INSERT the run record with the dedup key. If it already exists and succeeded, return the stored outcome.
Read checkpoint. Process the next unit. Write checkpoint only after the unit is safely committed.
Classify errors. Retry transient failures with backoff. Quarantine permanent failures with enough context to replay after fixing.

Operational details that make the patterns stick

Idempotent notifications: treat Slack/email/webhooks as side effects with their own dedup keys.
Observability: log the dedup key in every step so you can trace a replay end-to-end.
Schema evolution: version your payloads; store a payload hash to detect unexpected changes.
Runbooks: define who owns quarantine triage and how “fix-and-replay” is performed safely.

When these pieces are in place, replays stop being scary. They become a normal operational tool: safe, auditable, and fast to recover.