fix(tbtc/signer): resolve the state-encryption key under the ENGINE_STATE guard, in write order by mswilkison · Pull Request #4114 · threshold-network/keep-core

mswilkison · 2026-06-26T16:36:56Z

Summary

Resolves the at-rest state-encryption key under the held ENGINE_STATE guard, at the persist site, so key selection is ordered with the writes the guard already serializes.

This PR began as a perf change (resolve the key off the lock to keep the command-provider KMS subprocess from stalling ENGINE_STATE). A Codex review found that approach introduced a P1 state-loss bug, so the PR now lands the correct ordering instead.

The bug the off-lock approach introduced

ENGINE_STATE already serializes the persisted-state writes, but resolving the key before the lock left key selection out-of-order with the writes: a caller that resolved an older key (before a provider key-rotation) could win the last write after a concurrent caller already persisted the rotated key, tagging the single state file with the stale key-id. decode_encrypted_state_envelope accepts only the currently-resolved key-id, so a restart then rejects the file as a key-id mismatch and the persisted state is unreadable.

Fix

Resolve the key under the held guard, at the persist site — the last writer always encrypts with the then-current key. Non-marker sites call persist_engine_state_to_storage(&guard) (resolves under the guard, then persists). The two interactive marker sites resolve via state_encryption_key_material()? before inserting the durable consumed/aggregated marker (preserving fail-closed-before-marker + marker-rollback-on-persist-failure), then persist_..._with_key. The deferred require_resolved_state_key helper is removed.

Invariants preserved: idempotent replays / pre-persist rejections still return before any key resolution, so a transient key-provider outage cannot fail a non-persisting call (idempotent_build_tx_replay_survives_state_key_outage still holds).

Verification

Clean cargo build (no warnings); all 329 crate tests pass. An independent adversarial review confirmed all four ordering invariants at every converted site (0 correctness findings).

Tradeoff (tracked in #4121)

This reverses the off-lock optimization: under the command/KMS provider the subprocess now runs under the guard during actual persists (a bounded head-of-line stall; env provider is unaffected). Because the production profile mandates the command provider, this is the production path. Recovering off-lock safely needs added structure (a cheap provider key-id signal, a coordinated rotation protocol, or a partial-recovery persist-lock) — analyzed and tracked in #4121, gated on measuring whether the stall actually bottlenecks throughput.

🤖 Generated with Claude Code

…he outcome persist_engine_state_to_storage resolved the state-encryption key inside encode_encrypted_state_envelope, i.e. while the caller held the global ENGINE_STATE mutex. For the `command` (KMS/HSM) provider that forks and waits on the key subprocess on every state mutation while the lock is held, so a slow or hung key command stalled every concurrent signer operation for up to the configured timeout (default 30s, max 300s). Split persist into a thin wrapper (cold paths and tests) and persist_engine_state_to_storage_with_key, which takes pre-resolved key material. Every hot per-operation path (run_dkg, start/finalize_sign_round, build_taproot_tx, trigger_emergency_rekey, promote/rollback_canary, refresh_shares, interactive round2/aggregate) now resolves the key with state_encryption_key_material() BEFORE acquiring the lock, so the subprocess never runs under the lock. Only the in-memory serialize + encrypt + atomic write stay under the lock (they must, to preserve write ordering and avoid lost updates between concurrent ops). The resolution outcome is DEFERRED, not propagated eagerly: the key Result is consulted (via require_resolved_state_key) only at a site that actually persists. Idempotent replays and pre-persist rejections return without ever consulting it, so a transient key-provider outage cannot turn a cached read (e.g. re-serving an already-computed signature on a ROAST retry) into a failure. In the interactive round2/aggregate paths, the key is resolved into a local BEFORE the consumption/aggregation marker is inserted, so a key failure returns cleanly (no marker written) rather than escaping the marker-rollback block. Semantics preserved: the key is still resolved fresh on every call (no caching), so external KMS rotation and trigger_emergency_rekey keep picking up the current key. Tests: idempotent_build_tx_replay_survives_state_key_outage (a replay returns its cached tx with a failing key command) and interactive_round2_state_key_failure_does_not_burn_attempt (a Round2 that fails at the key step leaves no consumption marker and the attempt still succeeds on retry). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-26T16:39:00Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a137e131-7de9-4ac3-8003-5c582484ec78

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/signer-statekey-offlock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

… not before it #4114 resolved the state-encryption key OFF the ENGINE_STATE lock to keep the command-provider KMS subprocess from stalling the lock. But ENGINE_STATE already serializes the persisted-state WRITES, so resolving the key before the lock left key selection out-of-order with the writes: a caller that resolved an older key (before a provider key rotation) could win the LAST write after a concurrent caller already persisted the rotated key, tagging the single state file with the stale key id. decode_encrypted_state_envelope accepts only the currently-resolved key id, so a restart then rejects the file as a key-id mismatch and the persisted state is unreadable. Resolve the key UNDER the held ENGINE_STATE guard, at the persist site, so key selection is in the same serialized order as the writes -- the last writer always encrypts with the then-current key. Non-marker sites call persist_engine_state_to_storage(&guard) (resolves under the guard then persists); the two interactive marker sites resolve via state_encryption_key_material()? BEFORE inserting the durable marker (preserving fail-closed-before-marker and the marker-rollback-on-persist-failure), then persist_with_key. The deferred require_resolved_state_key helper is removed. Idempotent replays and pre-persist rejections still return before any key resolution, so a transient key-provider outage cannot fail a non-persisting call (the idempotent_build_tx_replay_survives_state_key_outage regression still holds). TRADEOFF: this reverses #4114's off-lock optimization -- under the `command` provider the KMS subprocess now runs under the guard during actual persists (a bounded head-of-line stall; a hung command is bounded by terminate_state_key_command). A future off-lock-preserving design would need a cheap rotation signal from the key provider; recovering that is left as a follow-up. Addresses the Codex P1 review on #4114. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mswilkison · 2026-06-27T00:55:08Z

Addressed the P1 (stale-key write-ordering race) in 19a3d8d.

Root cause confirmed: ENGINE_STATE already serializes the persisted-state writes, but the key was resolved before the lock — so key selection was out-of-order with the writes. A caller that resolved an older key (before a provider rotation) could win the last write after a concurrent caller already persisted the rotated key, tagging the single state file with the stale key-id; decode_encrypted_state_envelope accepts only the current key-id, so restart rejects it.

Fix: resolve the key under the held ENGINE_STATE guard, at the persist site, so key selection is in the same serialized order as the writes (the last writer always encrypts with the then-current key). Non-marker sites now call persist_engine_state_to_storage(&guard); the two interactive marker sites resolve via state_encryption_key_material()? before the marker insert (fail-closed-before-marker + marker-rollback intact). The deferred require_resolved_state_key helper is removed.

Invariants preserved: idempotent replays / pre-persist rejections still return before any key resolution (a key-provider outage can't fail a non-persisting call — the idempotent_build_tx_replay_survives_state_key_outage regression still holds). All 329 crate tests pass.

Tradeoff (explicitly flagged): this reverses the off-lock optimization this PR introduced — under the command provider the KMS subprocess now runs under the guard during actual persists (a bounded head-of-line stall; a hung command is bounded by terminate_state_key_command, and the env provider has no subprocess so is unaffected). Resolving under the guard is required for correctness because there's no cheap "is my key still current" signal and decode can't be made key-tolerant without storing old keys. Recovering off-lock behavior safely would need a cheap rotation signal from the key provider — left as a follow-up.

Reviewed via an independent adversarial pass: all four ordering invariants confirmed across every converted site; the only finding was this perf tradeoff (above).

… not before it #4114 resolved the state-encryption key OFF the ENGINE_STATE lock to keep the command-provider KMS subprocess from stalling the lock. But ENGINE_STATE already serializes the persisted-state WRITES, so resolving the key before the lock left key selection out-of-order with the writes: a caller that resolved an older key (before a provider key rotation) could win the LAST write after a concurrent caller already persisted the rotated key, tagging the single state file with the stale key id. decode_encrypted_state_envelope accepts only the currently-resolved key id, so a restart then rejects the file as a key-id mismatch and the persisted state is unreadable. Resolve the key UNDER the held ENGINE_STATE guard, at the persist site, so key selection is in the same serialized order as the writes -- the last writer always encrypts with the then-current key. Non-marker sites call persist_engine_state_to_storage(&guard) (resolves under the guard then persists); the two interactive marker sites resolve via state_encryption_key_material()? BEFORE inserting the durable marker (preserving fail-closed-before-marker and the marker-rollback-on-persist-failure), then persist_with_key. The deferred require_resolved_state_key helper is removed. Idempotent replays and pre-persist rejections still return before any key resolution, so a transient key-provider outage cannot fail a non-persisting call (the idempotent_build_tx_replay_survives_state_key_outage regression still holds). TRADEOFF: this reverses #4114's off-lock optimization -- under the `command` provider the KMS subprocess now runs under the guard during actual persists (a bounded head-of-line stall; a hung command is bounded by terminate_state_key_command). A future off-lock-preserving design would need a cheap rotation signal from the key provider; recovering that is left as a follow-up. Addresses the Codex P1 review on #4114. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mswilkison mentioned this pull request Jun 27, 2026

tbtc-signer: recover off-lock state-key resolution for the command/KMS provider (perf follow-up to #4114) #4121

Open

mswilkison changed the title ~~perf(tbtc/signer): resolve state-encryption key off the engine lock~~ fix(tbtc/signer): resolve the state-encryption key under the ENGINE_STATE guard, in write order Jun 27, 2026

mswilkison force-pushed the fix/signer-statekey-offlock branch from 19a3d8d to 77488ee Compare June 27, 2026 21:19

mswilkison merged commit a1232c7 into extraction/frost-signer-mirror-2026-05-26 Jun 27, 2026
20 checks passed

mswilkison deleted the fix/signer-statekey-offlock branch June 27, 2026 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tbtc/signer): resolve the state-encryption key under the ENGINE_STATE guard, in write order#4114

fix(tbtc/signer): resolve the state-encryption key under the ENGINE_STATE guard, in write order#4114
mswilkison merged 2 commits into
extraction/frost-signer-mirror-2026-05-26from
fix/signer-statekey-offlock

mswilkison commented Jun 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Review skipped

Uh oh!

mswilkison commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mswilkison commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug the off-lock approach introduced

Fix

Verification

Tradeoff (tracked in #4121)

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

mswilkison commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mswilkison commented Jun 26, 2026 •

edited

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading