fix(frost): prefer FFI error code over substring for replay detection#3961
Merged
mswilkison merged 1 commit intoMay 22, 2026
Conversation
The companion tbtc-signer PR routes the consumed-attempt replay path
through a dedicated `EngineError::ConsumedAttemptReplay` variant whose
`EngineError::code()` value is the contract-stable string
`consumed_attempt_replay`. This consumer change makes
`isBuildTaggedTBTCSignerConsumedAttemptReplayError` prefer that code
when it is reachable through the error chain so any future cosmetic
rewording of the Rust message on either side cannot silently break
replay detection.
The structured form is plumbed through by introducing a new
`buildTaggedTBTCSignerStructuredError` type that carries `Code` and
`Message` from the FFI envelope. `buildTaggedTBTCSignerResultStatusError`
wraps that struct via `%w` so callers can extract it through
`errors.As`. The existing `buildTaggedTBTCSignerErrorMessage` is
refactored to `buildTaggedTBTCSignerErrorPayload` returning the struct
directly; the rendered error-chain string still has the form
`"code: message"` so log readers see no change.
The detector then implements a small policy ladder:
1. If a structured envelope carries the new
`consumed_attempt_replay` code, return true.
2. If a structured envelope carries the legacy `validation_error`
code (pre-dedicated-variant signers route the replay path through
`EngineError::Validation`), substring-match only the Message
field so unrelated `validation_error`s carrying noise in their
wrapping chain are not mistaken for replays.
3. Any other recognized code is authoritative and not a replay.
4. If no structured envelope is reachable at all (pre-envelope
signer builds), fall back to substring-matching the whole
rendered string. The legacy wording is preserved by the current
tbtc-signer release so this branch continues to work during the
rolling upgrade window.
Two new test functions cover the new behavior:
- `TestIsBuildTaggedTBTCSignerConsumedAttemptReplayError` runs eight
cases through the detector: nil, structured code matches, structured
but different code rejects, structured empty code with legacy
wording in the message accepts, plain-wrapper string accepts, legacy
`validation_error` with replay wording accepts, `validation_error`
with unrelated message rejects, unrelated error rejects.
- `TestBuildTaggedTBTCSignerErrorPayload` exercises the new decoder
against structured envelopes, message-only envelopes, completely
empty envelopes, and malformed payloads.
Verification (local, GOCACHE under /private/tmp):
go test -tags 'frost_native frost_tbtc_signer' ./pkg/frost/... ./pkg/bitcoin
go test -tags 'frost_native frost_tbtc_signer' ./pkg/tbtc -run \
'TestConfigureFrostSigningBackend|TestNewNode_ConfiguresFrostSigningBackend|TestSigningExecutor_Sign|TestRegisterSignerMaterialResolverForBuild'
All pass.
This is the keep-core side of L5 from the independent PR #3866 review.
The Rust side is the matching tlabs-xyz/tbtc PR that introduces the
`consumed_attempt_replay` code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
Base automatically changed from
fix/frost-readiness-dkg-placeholder-pubkeys-2026-05-22
to
feat/frost-schnorr-migration-scaffold
May 22, 2026 22:48
e72867a
into
feat/frost-schnorr-migration-scaffold
15 checks passed
mswilkison
added a commit
that referenced
this pull request
May 22, 2026
…3962) ## Summary Adds **RFC-21** as the design doc that scopes the M4 (transition evidence) and M7 (ROAST-aware retry) findings from the independent review of #3866 into a single layered design and a phased, PR-sized implementation plan. This PR is **doc-only**. It introduces no behaviour change. Subsequent implementation PRs reference RFC-21 in their descriptions. Stacked on #3961. ## Why one design, not two M4 and M7 share the same notion of *attempt context* and *transition evidence*: - Fixing M4 alone produces evidence that no consumer reads. - Fixing M7 alone gives the consumer nothing to drive retry decisions on. The RFC treats them as one design split into linear phases. ## Phasing - **Phase 0** -- this RFC. - **Phase 1** -- `AttemptContext` type + canonical hash; protocol messages carry attempt-context binding (optional during migration). - **Phase 2** -- receiver overflow tracking (M4 layer A) plumbed through the three `select { default }` drop sites, default no-op. - **Phase 3** -- coordinator state machine: `BeginAttempt`, `RecordEvidence`, `NextAttempt`. Deterministic `(AttemptContext, TransitionEvidence) -> AttemptContext` map. - **Phase 4** -- wire receiver to coordinator behind `frost_roast_retry` build tag. - **Phase 5** -- retry adapter + `EvaluateRoastRetryForSigning`; migrate first call site behind the build tag with readiness-gate guard. - **Phase 6** -- migrate remaining call sites; delete the byte-identical-to-tECDSA shuffle once unused. - **Phase 7** -- flip the readiness manifest to `present` once Phase 6 ships and integration tests run against a real testnet (only then; no early flip). ## Open questions called out explicitly The RFC lists four open design questions that need cross-team review before Phase 3 lands: 1. Cross-process coordinator agreement -- gossip topic choice. 2. Persistence across signer restart. 3. FFI surface (Rust signer error-code style; follows the L5 pattern from #425 / #3961). 4. Backward-compat horizon for the `AttemptContextHash` field. ## Out of scope - DKG retry (separate RFC). - Bitcoin transaction-builder changes. - Operator UX changes (CLI, dashboards) -- land alongside Phase 5/6. - Cross-domain ROAST between keep-core and tbtc-signer. ## Test plan - [ ] Reviewer reads RFC end-to-end. - [ ] Reviewer flags any phase that should be split further or reordered before Phase 1 begins. - [ ] Reviewer answers the four open questions or marks them defer-to-Phase-3. No code change in this PR, so no CI test run is meaningful beyond asciidoc rendering.
3 tasks
mswilkison
added a commit
that referenced
this pull request
May 23, 2026
## Summary Promotes the Phase-3 design decisions settled in the **2026-05-22** cross-team review into a dedicated **Resolved Decisions** section of RFC-21. Doc-only; +180/-38. ## Why The previous draft listed Phase-3 questions under \"Open questions\" with a *recommended-entering-Phase-3* path that turned out, on review, to have a critical safety gap: the all-to-all signed-evidence gossip recommendation silently assumed gossip is synchronously consistent across the signer set. In practice gossip is eventually consistent, so two honest signers can hold divergent evidence sets at the moment the deterministic \`NextAttempt\` boundary triggers, producing divergent next-attempt contexts and fracturing the group. This PR locks the replacement design before Phase 3 implementation PRs begin landing. ## What the resolved-decisions section pins | Decision | Resolution | |---|---| | Cross-process coordinator agreement | **Coordinator-proposed aggregation** on a dedicated evidence topic, signed with operator key, receiver-side bundle verification for censorship detection. All-to-all gossip + local union is rejected with rationale. | | Source of \`DkgGroupPublicKey\` for seed | Extracted from FFI signer material at attempt construction time. No wallet-registry lookup on hot path. | | \`AttemptContext\` ↔ \`NativeExecutionFFISigningRequest\` | Field on request struct; Go-side orchestration only; does not cross CGO boundary. | | \`SelectCoordinator\` retention | Keep as helper; \`BeginAttempt\` bridges \`[32]byte\` seed to legacy \`int64\` via a sterile, named adapter. | | Evidence-signing key | Reuse existing operator key. | | Evidence message format | JSON wrapped in existing \`pkg/net/gen/pb\` envelope; routed via \`net.Message\`. | | Maximum evidence-message size | Single \`TransitionMessage\` per transition, ~10-20 KiB at 100-signer saturation. No chunking. | | Silence-parking transience (risk mitigation) | Strictly single-attempt skip, no escalation. A peer falsely labelled silent is reinstated by the very next attempt. | ## Layer-B exclusion-policy strengthening The exclusion-policy list in Layer B is extended with explicit \"no escalation\" wording for the silence/parking case. The risk Gemini's review surfaced (late-arriving evidence weaponised into permanent exclusion) is bounded by: - Silence parking ≤ 1 attempt. - Permanent exclusion only fires on overflow (transport-blamable) or non-transport reject (validation-blamable). Neither can trigger on a slow-but-honest peer. - Receiver-side bundle verification catches a coordinator that tries to censor an honest peer's signed snapshot. ## Open questions reduced to three What remains in the Open Questions section is genuinely open: - Persistence across signer restart (Phase 5+). - FFI surface for future exclusion-relevant errors (follows the L5 pattern from #425 / #3961). - \`AttemptContextHash\` backward-compat horizon (Phase 6+). ## Test plan - [ ] Reviewer reads the Resolved decisions section end-to-end. - [ ] Reviewer confirms the coordinator-aggregation flow as documented matches the agreed design. - [ ] AsciiDoc renders cleanly (CI step \`Publish contracts documentation\` covers this). No code change; no behaviour-test surface.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
`isBuildTaggedTBTCSignerConsumedAttemptReplayError` previously did a case-insensitive substring match on `err.Error()` looking for `attempt_id` plus `already consumed for sign attempt`. Any cosmetic rewording of the Rust-side format string on either side would silently disable replay detection. This PR makes the detector prefer the structured `code` field on the FFI error envelope when it is reachable, falling back to substring matching only for older signer builds.
Stacked on #3960.
The matching Rust-side change is a separate PR on `tlabs-xyz/tbtc` that introduces the dedicated `EngineError::ConsumedAttemptReplay` variant with stable code `consumed_attempt_replay`.
Changes
Verification