fix(tbtc/signer): resolve the state-encryption key under the ENGINE_STATE guard, in write order#4114
Conversation
…he outcome persist_engine_state_to_storage resolved the state-encryption key inside encode_encrypted_state_envelope, i.e. while the caller held the global ENGINE_STATE mutex. For the `command` (KMS/HSM) provider that forks and waits on the key subprocess on every state mutation while the lock is held, so a slow or hung key command stalled every concurrent signer operation for up to the configured timeout (default 30s, max 300s). Split persist into a thin wrapper (cold paths and tests) and persist_engine_state_to_storage_with_key, which takes pre-resolved key material. Every hot per-operation path (run_dkg, start/finalize_sign_round, build_taproot_tx, trigger_emergency_rekey, promote/rollback_canary, refresh_shares, interactive round2/aggregate) now resolves the key with state_encryption_key_material() BEFORE acquiring the lock, so the subprocess never runs under the lock. Only the in-memory serialize + encrypt + atomic write stay under the lock (they must, to preserve write ordering and avoid lost updates between concurrent ops). The resolution outcome is DEFERRED, not propagated eagerly: the key Result is consulted (via require_resolved_state_key) only at a site that actually persists. Idempotent replays and pre-persist rejections return without ever consulting it, so a transient key-provider outage cannot turn a cached read (e.g. re-serving an already-computed signature on a ROAST retry) into a failure. In the interactive round2/aggregate paths, the key is resolved into a local BEFORE the consumption/aggregation marker is inserted, so a key failure returns cleanly (no marker written) rather than escaping the marker-rollback block. Semantics preserved: the key is still resolved fresh on every call (no caching), so external KMS rotation and trigger_emergency_rekey keep picking up the current key. Tests: idempotent_build_tx_replay_survives_state_key_outage (a replay returns its cached tx with a failing key command) and interactive_round2_state_key_failure_does_not_burn_attempt (a Round2 that fails at the key step leaves no consumption marker and the attempt still succeeds on retry). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
… not before it #4114 resolved the state-encryption key OFF the ENGINE_STATE lock to keep the command-provider KMS subprocess from stalling the lock. But ENGINE_STATE already serializes the persisted-state WRITES, so resolving the key before the lock left key selection out-of-order with the writes: a caller that resolved an older key (before a provider key rotation) could win the LAST write after a concurrent caller already persisted the rotated key, tagging the single state file with the stale key id. decode_encrypted_state_envelope accepts only the currently-resolved key id, so a restart then rejects the file as a key-id mismatch and the persisted state is unreadable. Resolve the key UNDER the held ENGINE_STATE guard, at the persist site, so key selection is in the same serialized order as the writes -- the last writer always encrypts with the then-current key. Non-marker sites call persist_engine_state_to_storage(&guard) (resolves under the guard then persists); the two interactive marker sites resolve via state_encryption_key_material()? BEFORE inserting the durable marker (preserving fail-closed-before-marker and the marker-rollback-on-persist-failure), then persist_with_key. The deferred require_resolved_state_key helper is removed. Idempotent replays and pre-persist rejections still return before any key resolution, so a transient key-provider outage cannot fail a non-persisting call (the idempotent_build_tx_replay_survives_state_key_outage regression still holds). TRADEOFF: this reverses #4114's off-lock optimization -- under the `command` provider the KMS subprocess now runs under the guard during actual persists (a bounded head-of-line stall; a hung command is bounded by terminate_state_key_command). A future off-lock-preserving design would need a cheap rotation signal from the key provider; recovering that is left as a follow-up. Addresses the Codex P1 review on #4114. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the P1 (stale-key write-ordering race) in Root cause confirmed: Fix: resolve the key under the held Invariants preserved: idempotent replays / pre-persist rejections still return before any key resolution (a key-provider outage can't fail a non-persisting call — the Tradeoff (explicitly flagged): this reverses the off-lock optimization this PR introduced — under the Reviewed via an independent adversarial pass: all four ordering invariants confirmed across every converted site; the only finding was this perf tradeoff (above). |
… not before it #4114 resolved the state-encryption key OFF the ENGINE_STATE lock to keep the command-provider KMS subprocess from stalling the lock. But ENGINE_STATE already serializes the persisted-state WRITES, so resolving the key before the lock left key selection out-of-order with the writes: a caller that resolved an older key (before a provider key rotation) could win the LAST write after a concurrent caller already persisted the rotated key, tagging the single state file with the stale key id. decode_encrypted_state_envelope accepts only the currently-resolved key id, so a restart then rejects the file as a key-id mismatch and the persisted state is unreadable. Resolve the key UNDER the held ENGINE_STATE guard, at the persist site, so key selection is in the same serialized order as the writes -- the last writer always encrypts with the then-current key. Non-marker sites call persist_engine_state_to_storage(&guard) (resolves under the guard then persists); the two interactive marker sites resolve via state_encryption_key_material()? BEFORE inserting the durable marker (preserving fail-closed-before-marker and the marker-rollback-on-persist-failure), then persist_with_key. The deferred require_resolved_state_key helper is removed. Idempotent replays and pre-persist rejections still return before any key resolution, so a transient key-provider outage cannot fail a non-persisting call (the idempotent_build_tx_replay_survives_state_key_outage regression still holds). TRADEOFF: this reverses #4114's off-lock optimization -- under the `command` provider the KMS subprocess now runs under the guard during actual persists (a bounded head-of-line stall; a hung command is bounded by terminate_state_key_command). A future off-lock-preserving design would need a cheap rotation signal from the key provider; recovering that is left as a follow-up. Addresses the Codex P1 review on #4114. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
19a3d8d to
77488ee
Compare
a1232c7
into
extraction/frost-signer-mirror-2026-05-26
Summary
Resolves the at-rest state-encryption key under the held
ENGINE_STATEguard, at the persist site, so key selection is ordered with the writes the guard already serializes.This PR began as a perf change (resolve the key off the lock to keep the
command-provider KMS subprocess from stallingENGINE_STATE). A Codex review found that approach introduced a P1 state-loss bug, so the PR now lands the correct ordering instead.The bug the off-lock approach introduced
ENGINE_STATEalready serializes the persisted-state writes, but resolving the key before the lock left key selection out-of-order with the writes: a caller that resolved an older key (before a provider key-rotation) could win the last write after a concurrent caller already persisted the rotated key, tagging the single state file with the stale key-id.decode_encrypted_state_envelopeaccepts only the currently-resolved key-id, so a restart then rejects the file as a key-id mismatch and the persisted state is unreadable.Fix
Resolve the key under the held guard, at the persist site — the last writer always encrypts with the then-current key. Non-marker sites call
persist_engine_state_to_storage(&guard)(resolves under the guard, then persists). The two interactive marker sites resolve viastate_encryption_key_material()?before inserting the durable consumed/aggregated marker (preserving fail-closed-before-marker + marker-rollback-on-persist-failure), thenpersist_..._with_key. The deferredrequire_resolved_state_keyhelper is removed.Invariants preserved: idempotent replays / pre-persist rejections still return before any key resolution, so a transient key-provider outage cannot fail a non-persisting call (
idempotent_build_tx_replay_survives_state_key_outagestill holds).Verification
Clean
cargo build(no warnings); all 329 crate tests pass. An independent adversarial review confirmed all four ordering invariants at every converted site (0 correctness findings).Tradeoff (tracked in #4121)
This reverses the off-lock optimization: under the
command/KMS provider the subprocess now runs under the guard during actual persists (a bounded head-of-line stall;envprovider is unaffected). Because theproductionprofile mandates thecommandprovider, this is the production path. Recovering off-lock safely needs added structure (a cheap provider key-id signal, a coordinated rotation protocol, or a partial-recovery persist-lock) — analyzed and tracked in #4121, gated on measuring whether the stall actually bottlenecks throughput.🤖 Generated with Claude Code