Skip to content

fix(tbtc/signer): resolve the state-encryption key under the ENGINE_STATE guard, in write order#4114

Merged
mswilkison merged 2 commits into
extraction/frost-signer-mirror-2026-05-26from
fix/signer-statekey-offlock
Jun 27, 2026
Merged

fix(tbtc/signer): resolve the state-encryption key under the ENGINE_STATE guard, in write order#4114
mswilkison merged 2 commits into
extraction/frost-signer-mirror-2026-05-26from
fix/signer-statekey-offlock

Conversation

@mswilkison

@mswilkison mswilkison commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Resolves the at-rest state-encryption key under the held ENGINE_STATE guard, at the persist site, so key selection is ordered with the writes the guard already serializes.

This PR began as a perf change (resolve the key off the lock to keep the command-provider KMS subprocess from stalling ENGINE_STATE). A Codex review found that approach introduced a P1 state-loss bug, so the PR now lands the correct ordering instead.

The bug the off-lock approach introduced

ENGINE_STATE already serializes the persisted-state writes, but resolving the key before the lock left key selection out-of-order with the writes: a caller that resolved an older key (before a provider key-rotation) could win the last write after a concurrent caller already persisted the rotated key, tagging the single state file with the stale key-id. decode_encrypted_state_envelope accepts only the currently-resolved key-id, so a restart then rejects the file as a key-id mismatch and the persisted state is unreadable.

Fix

Resolve the key under the held guard, at the persist site — the last writer always encrypts with the then-current key. Non-marker sites call persist_engine_state_to_storage(&guard) (resolves under the guard, then persists). The two interactive marker sites resolve via state_encryption_key_material()? before inserting the durable consumed/aggregated marker (preserving fail-closed-before-marker + marker-rollback-on-persist-failure), then persist_..._with_key. The deferred require_resolved_state_key helper is removed.

Invariants preserved: idempotent replays / pre-persist rejections still return before any key resolution, so a transient key-provider outage cannot fail a non-persisting call (idempotent_build_tx_replay_survives_state_key_outage still holds).

Verification

Clean cargo build (no warnings); all 329 crate tests pass. An independent adversarial review confirmed all four ordering invariants at every converted site (0 correctness findings).

Tradeoff (tracked in #4121)

This reverses the off-lock optimization: under the command/KMS provider the subprocess now runs under the guard during actual persists (a bounded head-of-line stall; env provider is unaffected). Because the production profile mandates the command provider, this is the production path. Recovering off-lock safely needs added structure (a cheap provider key-id signal, a coordinated rotation protocol, or a partial-recovery persist-lock) — analyzed and tracked in #4121, gated on measuring whether the stall actually bottlenecks throughput.

🤖 Generated with Claude Code

…he outcome

persist_engine_state_to_storage resolved the state-encryption key inside
encode_encrypted_state_envelope, i.e. while the caller held the global
ENGINE_STATE mutex. For the `command` (KMS/HSM) provider that forks and waits on
the key subprocess on every state mutation while the lock is held, so a slow or
hung key command stalled every concurrent signer operation for up to the
configured timeout (default 30s, max 300s).

Split persist into a thin wrapper (cold paths and tests) and
persist_engine_state_to_storage_with_key, which takes pre-resolved key material.
Every hot per-operation path (run_dkg, start/finalize_sign_round,
build_taproot_tx, trigger_emergency_rekey, promote/rollback_canary,
refresh_shares, interactive round2/aggregate) now resolves the key with
state_encryption_key_material() BEFORE acquiring the lock, so the subprocess
never runs under the lock. Only the in-memory serialize + encrypt + atomic write
stay under the lock (they must, to preserve write ordering and avoid lost
updates between concurrent ops).

The resolution outcome is DEFERRED, not propagated eagerly: the key Result is
consulted (via require_resolved_state_key) only at a site that actually
persists. Idempotent replays and pre-persist rejections return without ever
consulting it, so a transient key-provider outage cannot turn a cached read
(e.g. re-serving an already-computed signature on a ROAST retry) into a failure.
In the interactive round2/aggregate paths, the key is resolved into a local
BEFORE the consumption/aggregation marker is inserted, so a key failure returns
cleanly (no marker written) rather than escaping the marker-rollback block.

Semantics preserved: the key is still resolved fresh on every call (no caching),
so external KMS rotation and trigger_emergency_rekey keep picking up the current
key.

Tests: idempotent_build_tx_replay_survives_state_key_outage (a replay returns
its cached tx with a failing key command) and
interactive_round2_state_key_failure_does_not_burn_attempt (a Round2 that fails
at the key step leaves no consumption marker and the attempt still succeeds on
retry).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a137e131-7de9-4ac3-8003-5c582484ec78

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/signer-statekey-offlock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

mswilkison added a commit that referenced this pull request Jun 27, 2026
… not before it

#4114 resolved the state-encryption key OFF the ENGINE_STATE lock to keep the
command-provider KMS subprocess from stalling the lock. But ENGINE_STATE already
serializes the persisted-state WRITES, so resolving the key before the lock left
key selection out-of-order with the writes: a caller that resolved an older key
(before a provider key rotation) could win the LAST write after a concurrent
caller already persisted the rotated key, tagging the single state file with the
stale key id. decode_encrypted_state_envelope accepts only the currently-resolved
key id, so a restart then rejects the file as a key-id mismatch and the persisted
state is unreadable.

Resolve the key UNDER the held ENGINE_STATE guard, at the persist site, so key
selection is in the same serialized order as the writes -- the last writer always
encrypts with the then-current key. Non-marker sites call
persist_engine_state_to_storage(&guard) (resolves under the guard then persists);
the two interactive marker sites resolve via state_encryption_key_material()?
BEFORE inserting the durable marker (preserving fail-closed-before-marker and the
marker-rollback-on-persist-failure), then persist_with_key. The deferred
require_resolved_state_key helper is removed.

Idempotent replays and pre-persist rejections still return before any key
resolution, so a transient key-provider outage cannot fail a non-persisting call
(the idempotent_build_tx_replay_survives_state_key_outage regression still holds).

TRADEOFF: this reverses #4114's off-lock optimization -- under the `command`
provider the KMS subprocess now runs under the guard during actual persists (a
bounded head-of-line stall; a hung command is bounded by terminate_state_key_command).
A future off-lock-preserving design would need a cheap rotation signal from the
key provider; recovering that is left as a follow-up.

Addresses the Codex P1 review on #4114.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mswilkison

Copy link
Copy Markdown
Contributor Author

Addressed the P1 (stale-key write-ordering race) in 19a3d8d.

Root cause confirmed: ENGINE_STATE already serializes the persisted-state writes, but the key was resolved before the lock — so key selection was out-of-order with the writes. A caller that resolved an older key (before a provider rotation) could win the last write after a concurrent caller already persisted the rotated key, tagging the single state file with the stale key-id; decode_encrypted_state_envelope accepts only the current key-id, so restart rejects it.

Fix: resolve the key under the held ENGINE_STATE guard, at the persist site, so key selection is in the same serialized order as the writes (the last writer always encrypts with the then-current key). Non-marker sites now call persist_engine_state_to_storage(&guard); the two interactive marker sites resolve via state_encryption_key_material()? before the marker insert (fail-closed-before-marker + marker-rollback intact). The deferred require_resolved_state_key helper is removed.

Invariants preserved: idempotent replays / pre-persist rejections still return before any key resolution (a key-provider outage can't fail a non-persisting call — the idempotent_build_tx_replay_survives_state_key_outage regression still holds). All 329 crate tests pass.

Tradeoff (explicitly flagged): this reverses the off-lock optimization this PR introduced — under the command provider the KMS subprocess now runs under the guard during actual persists (a bounded head-of-line stall; a hung command is bounded by terminate_state_key_command, and the env provider has no subprocess so is unaffected). Resolving under the guard is required for correctness because there's no cheap "is my key still current" signal and decode can't be made key-tolerant without storing old keys. Recovering off-lock behavior safely would need a cheap rotation signal from the key provider — left as a follow-up.

Reviewed via an independent adversarial pass: all four ordering invariants confirmed across every converted site; the only finding was this perf tradeoff (above).

@mswilkison mswilkison changed the title perf(tbtc/signer): resolve state-encryption key off the engine lock fix(tbtc/signer): resolve the state-encryption key under the ENGINE_STATE guard, in write order Jun 27, 2026
… not before it

#4114 resolved the state-encryption key OFF the ENGINE_STATE lock to keep the
command-provider KMS subprocess from stalling the lock. But ENGINE_STATE already
serializes the persisted-state WRITES, so resolving the key before the lock left
key selection out-of-order with the writes: a caller that resolved an older key
(before a provider key rotation) could win the LAST write after a concurrent
caller already persisted the rotated key, tagging the single state file with the
stale key id. decode_encrypted_state_envelope accepts only the currently-resolved
key id, so a restart then rejects the file as a key-id mismatch and the persisted
state is unreadable.

Resolve the key UNDER the held ENGINE_STATE guard, at the persist site, so key
selection is in the same serialized order as the writes -- the last writer always
encrypts with the then-current key. Non-marker sites call
persist_engine_state_to_storage(&guard) (resolves under the guard then persists);
the two interactive marker sites resolve via state_encryption_key_material()?
BEFORE inserting the durable marker (preserving fail-closed-before-marker and the
marker-rollback-on-persist-failure), then persist_with_key. The deferred
require_resolved_state_key helper is removed.

Idempotent replays and pre-persist rejections still return before any key
resolution, so a transient key-provider outage cannot fail a non-persisting call
(the idempotent_build_tx_replay_survives_state_key_outage regression still holds).

TRADEOFF: this reverses #4114's off-lock optimization -- under the `command`
provider the KMS subprocess now runs under the guard during actual persists (a
bounded head-of-line stall; a hung command is bounded by terminate_state_key_command).
A future off-lock-preserving design would need a cheap rotation signal from the
key provider; recovering that is left as a follow-up.

Addresses the Codex P1 review on #4114.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mswilkison mswilkison force-pushed the fix/signer-statekey-offlock branch from 19a3d8d to 77488ee Compare June 27, 2026 21:19
@mswilkison mswilkison merged commit a1232c7 into extraction/frost-signer-mirror-2026-05-26 Jun 27, 2026
20 checks passed
@mswilkison mswilkison deleted the fix/signer-statekey-offlock branch June 27, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant