Skip to content

fix(frost): prefer FFI error code over substring for replay detection#3961

Merged
mswilkison merged 1 commit into
feat/frost-schnorr-migration-scaffoldfrom
fix/frost-readiness-replay-error-codes-2026-05-22
May 22, 2026
Merged

fix(frost): prefer FFI error code over substring for replay detection#3961
mswilkison merged 1 commit into
feat/frost-schnorr-migration-scaffoldfrom
fix/frost-readiness-replay-error-codes-2026-05-22

Conversation

@mswilkison

Copy link
Copy Markdown
Contributor

Summary

`isBuildTaggedTBTCSignerConsumedAttemptReplayError` previously did a case-insensitive substring match on `err.Error()` looking for `attempt_id` plus `already consumed for sign attempt`. Any cosmetic rewording of the Rust-side format string on either side would silently disable replay detection. This PR makes the detector prefer the structured `code` field on the FFI error envelope when it is reachable, falling back to substring matching only for older signer builds.

Stacked on #3960.

The matching Rust-side change is a separate PR on `tlabs-xyz/tbtc` that introduces the dedicated `EngineError::ConsumedAttemptReplay` variant with stable code `consumed_attempt_replay`.

Changes

  • New `buildTaggedTBTCSignerStructuredError` type carries `Code` and `Message` from the FFI envelope. `buildTaggedTBTCSignerResultStatusError` wraps it via `%w` so callers can extract via `errors.As`.
  • `buildTaggedTBTCSignerErrorMessage` refactored into `buildTaggedTBTCSignerErrorPayload` returning the struct. Rendered chain string still formats as `"code: message"` so log readers see no change.
  • Detector implements a small policy ladder:
    1. Structured envelope with `consumed_attempt_replay` code → replay.
    2. Structured envelope with legacy `validation_error` code → substring-match only the `Message` field (pre-dedicated-variant signers route the replay path through Validation).
    3. Any other recognized code → not a replay (authoritative).
    4. No structured envelope reachable → substring-match the whole rendered string (pre-envelope signers).
  • Two new tests:
    • `TestIsBuildTaggedTBTCSignerConsumedAttemptReplayError` — eight cases covering all four ladder branches.
    • `TestBuildTaggedTBTCSignerErrorPayload` — structured / message-only / empty / malformed payloads.

Verification

  • `go test -tags 'frost_native frost_tbtc_signer' ./pkg/frost/... ./pkg/bitcoin` — PASS
  • `go test -tags 'frost_native frost_tbtc_signer' ./pkg/tbtc -run 'TestConfigureFrostSigningBackend|TestNewNode_ConfiguresFrostSigningBackend|TestSigningExecutor_Sign|TestRegisterSignerMaterialResolverForBuild'` — PASS

The companion tbtc-signer PR routes the consumed-attempt replay path
through a dedicated `EngineError::ConsumedAttemptReplay` variant whose
`EngineError::code()` value is the contract-stable string
`consumed_attempt_replay`. This consumer change makes
`isBuildTaggedTBTCSignerConsumedAttemptReplayError` prefer that code
when it is reachable through the error chain so any future cosmetic
rewording of the Rust message on either side cannot silently break
replay detection.

The structured form is plumbed through by introducing a new
`buildTaggedTBTCSignerStructuredError` type that carries `Code` and
`Message` from the FFI envelope. `buildTaggedTBTCSignerResultStatusError`
wraps that struct via `%w` so callers can extract it through
`errors.As`. The existing `buildTaggedTBTCSignerErrorMessage` is
refactored to `buildTaggedTBTCSignerErrorPayload` returning the struct
directly; the rendered error-chain string still has the form
`"code: message"` so log readers see no change.

The detector then implements a small policy ladder:

  1. If a structured envelope carries the new
     `consumed_attempt_replay` code, return true.
  2. If a structured envelope carries the legacy `validation_error`
     code (pre-dedicated-variant signers route the replay path through
     `EngineError::Validation`), substring-match only the Message
     field so unrelated `validation_error`s carrying noise in their
     wrapping chain are not mistaken for replays.
  3. Any other recognized code is authoritative and not a replay.
  4. If no structured envelope is reachable at all (pre-envelope
     signer builds), fall back to substring-matching the whole
     rendered string. The legacy wording is preserved by the current
     tbtc-signer release so this branch continues to work during the
     rolling upgrade window.

Two new test functions cover the new behavior:

- `TestIsBuildTaggedTBTCSignerConsumedAttemptReplayError` runs eight
  cases through the detector: nil, structured code matches, structured
  but different code rejects, structured empty code with legacy
  wording in the message accepts, plain-wrapper string accepts, legacy
  `validation_error` with replay wording accepts, `validation_error`
  with unrelated message rejects, unrelated error rejects.
- `TestBuildTaggedTBTCSignerErrorPayload` exercises the new decoder
  against structured envelopes, message-only envelopes, completely
  empty envelopes, and malformed payloads.

Verification (local, GOCACHE under /private/tmp):

  go test -tags 'frost_native frost_tbtc_signer' ./pkg/frost/... ./pkg/bitcoin
  go test -tags 'frost_native frost_tbtc_signer' ./pkg/tbtc -run \
    'TestConfigureFrostSigningBackend|TestNewNode_ConfiguresFrostSigningBackend|TestSigningExecutor_Sign|TestRegisterSignerMaterialResolverForBuild'

All pass.

This is the keep-core side of L5 from the independent PR #3866 review.
The Rust side is the matching tlabs-xyz/tbtc PR that introduces the
`consumed_attempt_replay` code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Base automatically changed from fix/frost-readiness-dkg-placeholder-pubkeys-2026-05-22 to feat/frost-schnorr-migration-scaffold May 22, 2026 22:48
@mswilkison mswilkison merged commit e72867a into feat/frost-schnorr-migration-scaffold May 22, 2026
15 checks passed
@mswilkison mswilkison deleted the fix/frost-readiness-replay-error-codes-2026-05-22 branch May 22, 2026 22:48
mswilkison added a commit that referenced this pull request May 22, 2026
…3962)

## Summary

Adds **RFC-21** as the design doc that scopes the M4 (transition
evidence) and M7 (ROAST-aware retry) findings from the independent
review of #3866 into a single layered design and a phased,
PR-sized implementation plan.

This PR is **doc-only**. It introduces no behaviour change. Subsequent
implementation PRs reference RFC-21 in their descriptions.

Stacked on #3961.

## Why one design, not two

M4 and M7 share the same notion of *attempt context* and *transition
evidence*:

- Fixing M4 alone produces evidence that no consumer reads.
- Fixing M7 alone gives the consumer nothing to drive retry decisions
on.

The RFC treats them as one design split into linear phases.

## Phasing

- **Phase 0** -- this RFC.
- **Phase 1** -- `AttemptContext` type + canonical hash; protocol
  messages carry attempt-context binding (optional during migration).
- **Phase 2** -- receiver overflow tracking (M4 layer A) plumbed
  through the three `select { default }` drop sites, default no-op.
- **Phase 3** -- coordinator state machine: `BeginAttempt`,
  `RecordEvidence`, `NextAttempt`. Deterministic
  `(AttemptContext, TransitionEvidence) -> AttemptContext` map.
- **Phase 4** -- wire receiver to coordinator behind
  `frost_roast_retry` build tag.
- **Phase 5** -- retry adapter +
  `EvaluateRoastRetryForSigning`; migrate first call site behind
  the build tag with readiness-gate guard.
- **Phase 6** -- migrate remaining call sites; delete the
  byte-identical-to-tECDSA shuffle once unused.
- **Phase 7** -- flip the readiness manifest to `present` once Phase
  6 ships and integration tests run against a real testnet (only
  then; no early flip).

## Open questions called out explicitly

The RFC lists four open design questions that need cross-team
review before Phase 3 lands:

1. Cross-process coordinator agreement -- gossip topic choice.
2. Persistence across signer restart.
3. FFI surface (Rust signer error-code style; follows the L5
   pattern from #425 / #3961).
4. Backward-compat horizon for the `AttemptContextHash` field.

## Out of scope

- DKG retry (separate RFC).
- Bitcoin transaction-builder changes.
- Operator UX changes (CLI, dashboards) -- land alongside Phase 5/6.
- Cross-domain ROAST between keep-core and tbtc-signer.

## Test plan

- [ ] Reviewer reads RFC end-to-end.
- [ ] Reviewer flags any phase that should be split further or
  reordered before Phase 1 begins.
- [ ] Reviewer answers the four open questions or marks them
  defer-to-Phase-3.

No code change in this PR, so no CI test run is meaningful beyond
asciidoc rendering.
mswilkison added a commit that referenced this pull request May 23, 2026
## Summary

Promotes the Phase-3 design decisions settled in the **2026-05-22**
cross-team review into a dedicated **Resolved Decisions** section
of RFC-21. Doc-only; +180/-38.

## Why

The previous draft listed Phase-3 questions under \"Open questions\"
with a *recommended-entering-Phase-3* path that turned out, on
review, to have a critical safety gap: the all-to-all signed-evidence
gossip recommendation silently assumed gossip is synchronously
consistent across the signer set. In practice gossip is eventually
consistent, so two honest signers can hold divergent evidence sets
at the moment the deterministic \`NextAttempt\` boundary triggers,
producing divergent next-attempt contexts and fracturing the group.

This PR locks the replacement design before Phase 3 implementation
PRs begin landing.

## What the resolved-decisions section pins

| Decision | Resolution |
|---|---|
| Cross-process coordinator agreement | **Coordinator-proposed
aggregation** on a dedicated evidence topic, signed with operator key,
receiver-side bundle verification for censorship detection. All-to-all
gossip + local union is rejected with rationale. |
| Source of \`DkgGroupPublicKey\` for seed | Extracted from FFI signer
material at attempt construction time. No wallet-registry lookup on hot
path. |
| \`AttemptContext\` ↔ \`NativeExecutionFFISigningRequest\` | Field on
request struct; Go-side orchestration only; does not cross CGO boundary.
|
| \`SelectCoordinator\` retention | Keep as helper; \`BeginAttempt\`
bridges \`[32]byte\` seed to legacy \`int64\` via a sterile, named
adapter. |
| Evidence-signing key | Reuse existing operator key. |
| Evidence message format | JSON wrapped in existing \`pkg/net/gen/pb\`
envelope; routed via \`net.Message\`. |
| Maximum evidence-message size | Single \`TransitionMessage\` per
transition, ~10-20 KiB at 100-signer saturation. No chunking. |
| Silence-parking transience (risk mitigation) | Strictly single-attempt
skip, no escalation. A peer falsely labelled silent is reinstated by the
very next attempt. |

## Layer-B exclusion-policy strengthening

The exclusion-policy list in Layer B is extended with explicit
\"no escalation\" wording for the silence/parking case. The risk
Gemini's review surfaced (late-arriving evidence weaponised into
permanent exclusion) is bounded by:

- Silence parking ≤ 1 attempt.
- Permanent exclusion only fires on overflow (transport-blamable)
  or non-transport reject (validation-blamable). Neither can
  trigger on a slow-but-honest peer.
- Receiver-side bundle verification catches a coordinator that
  tries to censor an honest peer's signed snapshot.

## Open questions reduced to three

What remains in the Open Questions section is genuinely open:

- Persistence across signer restart (Phase 5+).
- FFI surface for future exclusion-relevant errors (follows the
  L5 pattern from #425 / #3961).
- \`AttemptContextHash\` backward-compat horizon (Phase 6+).

## Test plan

- [ ] Reviewer reads the Resolved decisions section end-to-end.
- [ ] Reviewer confirms the coordinator-aggregation flow as
  documented matches the agreed design.
- [ ] AsciiDoc renders cleanly (CI step \`Publish contracts
  documentation\` covers this).

No code change; no behaviour-test surface.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant