feat(machine-validation): Implement M2 machine validation heartbeat recovery#2838
Conversation
Signed-off-by: Sunil Kumar <sunilkumar@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Summary by CodeRabbit
WalkthroughThis PR adds heartbeat tracking for machine validation runs, including a new database timestamp, RPC contract, server handler, persistence logic, stale reconciliation, client-side periodic heartbeats, and updated tests. ChangesMachine Validation Heartbeat Lifecycle
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
crates/machine-validation/src/machine_validation.rs (2)
413-417: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winDo not discard heartbeat task panics.
After
abort(), cancellation is expected, butlet _ = heartbeat_task.awaitalso hides a prior panic from the heartbeat task. Preserve cancellation handling while surfacing panics.As per path instructions, spawned Rust background work should be joined so panics propagate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-validation/src/machine_validation.rs` around lines 413 - 417, The heartbeat join logic is swallowing task panics by discarding the result of heartbeat_task.await after abort() in machine_validation::machine_validation; update the cleanup so cancellation from abort() is still treated as expected, but any JoinError panic from the spawned heartbeat task is surfaced instead of ignored. Use the existing heartbeat_task handling around the command_result match to explicitly inspect the await result and propagate panics while allowing cancellation.Source: Path instructions
171-178: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse structured heartbeat log fields.
These new logs interpolate errors and omit
validation_id/test_id, making heartbeat failures harder to correlate in logfmt output. Prefer tracing fields, e.g.error=%e,%validation_id,test_id=%test_id.As per path instructions, Rust logs should use structured tracing fields instead of interpolation.
Also applies to: 302-304
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-validation/src/machine_validation.rs` around lines 171 - 178, The heartbeat logging in the machine validation flow is using string interpolation instead of structured tracing fields, which makes failures harder to correlate. Update the logging in the machine validation heartbeat branch (including the error path and the rejected-heartbeat path) to use structured fields with the existing symbols like validation_id and test_id, and log the error as a field rather than formatting it into the message. Apply the same structured logging pattern to the other heartbeat log site mentioned in the diff so all machine_validation.rs heartbeat logs are consistent.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api-core/src/machine_validation/mod.rs`:
- Around line 107-115: The new stale attempt reconciliation in
machine_validation::run is too aggressive for legacy runs that never recorded a
heartbeat. Update the loop over find_stale_active_attempts so
reconcile_stale_attempt(txn.as_pgconn(), stale_attempt, now) only runs for
heartbeat-capable attempts, and leave heartbeat-less attempts to be handled by
the existing stale_validations path with the expected-duration-plus-grace
timeout behavior.
In `@crates/api-db/src/machine_validation_execution.rs`:
- Around line 247-255: The targeted heartbeat flow in record_heartbeat() is
updating machine_validation.last_heartbeat_at before confirming the run
item/attempt heartbeat can be accepted, so a rejected child heartbeat can still
refresh the parent run. Rework the logic around update_run_heartbeat(),
update_run_item_heartbeat(), and update_attempt_heartbeat() so the parent
heartbeat is only persisted after the child update succeeds, or ensure the
transaction is aborted/rolled back whenever any targeted heartbeat returns
false. Keep the all-or-nothing behavior inside
machine_validation_execution::record_heartbeat() so machine_validation.rs does
not commit a partial heartbeat update.
In `@crates/machine-validation/src/machine_validation.rs`:
- Around line 400-403: Start the heartbeat before any potentially long-running
setup in machine_validation::MachineValidation and keep it active for the entire
test execution path. Move the spawn_machine_validation_heartbeat call so it
begins before external config fetch, precondition execution, and container pull,
and make sure every early return in this flow still cancels the heartbeat task.
Use the existing validation_id/test.test_id flow and the heartbeat_task handle
to ensure the task is always started and cleaned up around all branches.
- Around line 297-305: The initial heartbeat handling in machine_validation.rs
should stop execution when heartbeat_machine_validation_run returns Ok(false)
instead of just logging it. Update the match in the initial heartbeat block so a
rejected heartbeat produces a terminal/skipped outcome and prevents the later
config fetch and command execution, while keeping the Ok(true) and Err(e) paths
unchanged. Use the existing heartbeat_machine_validation_run flow and
surrounding validation run/test context to short-circuit before the command path
continues.
- Around line 40-41: The fixed heartbeat interval in machine validation can
outlive a user-configured stale_run_timeout, causing healthy runs to be marked
stale. Update the heartbeat logic in machine_validation.rs around
MACHINE_VALIDATION_HEARTBEAT_INTERVAL and the code that uses stale_run_timeout
so the cadence is configurable or the timeout is clamped to a minimum above the
heartbeat interval. Make sure the relevant machine validation path enforces this
relationship wherever the heartbeat is scheduled.
In `@crates/rpc/proto/forge.proto`:
- Around line 6326-6330: The MachineValidationHeartbeatRequest target fields are
currently modeled as independent optionals, which allows conflicting identifiers
to be sent together and makes the heartbeat selector ambiguous. Update the
protobuf definition to group run_item_id, attempt_id, and test_id inside a oneof
in MachineValidationHeartbeatRequest, while preserving the ability to omit the
selector entirely for run-level heartbeats. Keep the existing field numbers and
ensure the generated client surface reflects the exclusive-choice contract so
callers cannot construct invalid combinations.
---
Nitpick comments:
In `@crates/machine-validation/src/machine_validation.rs`:
- Around line 413-417: The heartbeat join logic is swallowing task panics by
discarding the result of heartbeat_task.await after abort() in
machine_validation::machine_validation; update the cleanup so cancellation from
abort() is still treated as expected, but any JoinError panic from the spawned
heartbeat task is surfaced instead of ignored. Use the existing heartbeat_task
handling around the command_result match to explicitly inspect the await result
and propagate panics while allowing cancellation.
- Around line 171-178: The heartbeat logging in the machine validation flow is
using string interpolation instead of structured tracing fields, which makes
failures harder to correlate. Update the logging in the machine validation
heartbeat branch (including the error path and the rejected-heartbeat path) to
use structured fields with the existing symbols like validation_id and test_id,
and log the error as a field rather than formatting it into the message. Apply
the same structured logging pattern to the other heartbeat log site mentioned in
the diff so all machine_validation.rs heartbeat logs are consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: badee611-89de-4527-bb88-8062a98719b6
📒 Files selected for processing (12)
crates/api-core/src/api.rscrates/api-core/src/auth/internal_rbac_rules.rscrates/api-core/src/handlers/machine_validation.rscrates/api-core/src/machine_validation/mod.rscrates/api-core/src/tests/machine_validation.rscrates/api-db/migrations/20260624120000_machine_validation_heartbeat_recovery.sqlcrates/api-db/src/machine_validation.rscrates/api-db/src/machine_validation_execution.rscrates/api-model/src/machine_validation.rscrates/machine-validation/src/machine_validation.rscrates/rpc/proto/forge.protocrates/rpc/src/model/machine_validation.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/machine-validation/src/machine_validation.rs`:
- Line 430: The heartbeat is being stopped too early in
execute_machinevalidation_command, which can let the attempt go stale before the
terminal result is saved. Keep the heartbeat task running through the extra
output/error file reads and the run() persistence path, and only call
stop_machine_validation_heartbeat after persist_validation_result has succeeded
or after a definitive failure in the result persistence flow. Use the
execute_machinevalidation_command, stop_machine_validation_heartbeat, and
persist_validation_result flow to place the stop at the end of the full
validation lifecycle.
- Around line 43-45: The heartbeat cancellation path in
stop_machine_validation_heartbeat currently aborts and then discards the
JoinHandle result, and the handle can also be dropped implicitly on async
cancellation after the validation wait. Introduce a small guard around the
heartbeat_task that aborts on Drop so abandoned validations cannot keep
heartbeating, and update the explicit stop() cleanup to await the JoinHandle and
handle JoinError::is_panic() separately from expected cancellation. Keep the fix
localized around stop_machine_validation_heartbeat and the validation flow that
awaits the heartbeat task so panic visibility is preserved while normal aborts
stay quiet.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1e2cb6f0-dce2-4375-9b54-9d33e0571069
📒 Files selected for processing (5)
crates/api-core/src/handlers/machine_validation.rscrates/api-core/src/machine_validation/mod.rscrates/api-core/src/tests/machine_validation.rscrates/machine-validation/src/machine_validation.rscrates/rpc/proto/forge.proto
🚧 Files skipped from review as they are similar to previous changes (4)
- crates/api-core/src/handlers/machine_validation.rs
- crates/rpc/proto/forge.proto
- crates/api-core/src/machine_validation/mod.rs
- crates/api-core/src/tests/machine_validation.rs
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/machine-validation/src/machine_validation.rs (1)
597-610: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftAdd a run-level heartbeat around the full validation lifecycle.
The loop only manages the per-test heartbeat after
execute_machinevalidation_command()starts.get_container_auth_config()andget_container_images()run before this and can be long-running, leaving the validation run without a fresh run-level heartbeat even though Scout is alive. Start a run-level heartbeat before top-level preparation and keep it alive until all tests are persisted, while retaining the per-test heartbeat for active attempts. As per path instructions, review Rust code against STYLE_GUIDE.md with emphasis on “joined/cancellable background tasks” and “resource lifetimes”.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-validation/src/machine_validation.rs` around lines 597 - 610, The validation loop only starts the heartbeat after execute_machinevalidation_command(), so the prep work in get_container_auth_config() and get_container_images() can run without any run-level heartbeat. Update machine_validation.rs around the validation lifecycle in the validation loop to start a run-level heartbeat before any top-level preparation begins, keep it active through all tests and persist(Some(result)) calls, and stop it only after the full run is finished. Retain the existing per-test heartbeat inside execute_machinevalidation_command() for active attempts, and make sure the new heartbeat task follows the joined/cancellable background task and resource lifetime patterns used elsewhere.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/machine-validation/src/machine_validation.rs`:
- Around line 597-610: The validation loop only starts the heartbeat after
execute_machinevalidation_command(), so the prep work in
get_container_auth_config() and get_container_images() can run without any
run-level heartbeat. Update machine_validation.rs around the validation
lifecycle in the validation loop to start a run-level heartbeat before any
top-level preparation begins, keep it active through all tests and
persist(Some(result)) calls, and stop it only after the full run is finished.
Retain the existing per-test heartbeat inside
execute_machinevalidation_command() for active attempts, and make sure the new
heartbeat task follows the joined/cancellable background task and resource
lifetime patterns used elsewhere.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6d491e09-b8a1-4133-a989-559e80ea25aa
📒 Files selected for processing (1)
crates/machine-validation/src/machine_validation.rs
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/machine-controller/src/config/machine_validation.rs (1)
123-154: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPrefer table-driven tests for clamp cases.
These tests are correct, but this input→output behavior fits the repository’s preferred table-driven style (
value_scenarios!/check_values) for easier extension with additional boundary rows.As per coding guidelines, "Prefer table-driven tests using the carbide-test-support crate with
scenarios!for fallible operations andvalue_scenarios!for total operations." As per path instructions, "crates/**/*.rs: Review Rust code against STYLE_GUIDE.md ... missing tests over style-only comments."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/config/machine_validation.rs` around lines 123 - 154, The two `with_minimum_stale_run_timeout` tests are correct but should be rewritten in the repository’s table-driven style instead of separate cases. Update the `tests` module in `machine_validation.rs` to use `value_scenarios!`/`check_values` for the clamp behavior around `MachineValidationConfig::with_minimum_stale_run_timeout`, covering both the low-value clamp and the preserved-safe-value case in one extensible table. Keep the assertions against `MachineValidationConfig::MIN_STALE_RUN_TIMEOUT` and a safe configured timeout, but express them as scenario rows so additional boundary cases can be added easily.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/machine-controller/src/config/machine_validation.rs`:
- Around line 123-154: The two `with_minimum_stale_run_timeout` tests are
correct but should be rewritten in the repository’s table-driven style instead
of separate cases. Update the `tests` module in `machine_validation.rs` to use
`value_scenarios!`/`check_values` for the clamp behavior around
`MachineValidationConfig::with_minimum_stale_run_timeout`, covering both the
low-value clamp and the preserved-safe-value case in one extensible table. Keep
the assertions against `MachineValidationConfig::MIN_STALE_RUN_TIMEOUT` and a
safe configured timeout, but express them as scenario rows so additional
boundary cases can be added easily.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e73627fe-6be2-4bb5-9935-bf8d6f7b9ece
📒 Files selected for processing (2)
crates/api-core/src/machine_validation/mod.rscrates/machine-controller/src/config/machine_validation.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/api-core/src/machine_validation/mod.rs
|
Thanks @wminckler for the feedback and for the approval. I'll raise separate PR to address the feedback and will merge this one. |
…back (#2911) Addresses the review feedback on the machine validation stale-run changes (#2838). Summary - Simplified stale timeout clamping logic. - Kept the minimum stale timeout enforcement so healthy runs do not go stale between heartbeats. - Added the stale reconciliation error message to logs. - Added focused DB/API test coverage for stale-run detection and reconciliation. ## Related issues Fixes #454 ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Breaking Changes <!-- If checked, describe the breaking changes and migration steps --> <!-- Breaking changes are not generally permitted, please discuss on a GitHub discussion or with the development team if you believe you need to break a backward compatibility guarantee --> - [ ] **This PR contains breaking changes** ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [x] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes
Summary
This change makes machine validation more reliable when Scout or a validation test gets stuck, crashes, or stops reporting back.
Before this, the system mostly knew: “validation started” and later “validation finished.” If Scout died in the middle, or a test stopped running but never sent a final result, the API could be left waiting until a broader timeout path cleaned it up.
Now we added a “heartbeat” mechanism:
Why you should care:
Description
Implements the M2 machine validation reliability work from the design in #2070, covering run/attempt heartbeats and stale validation recovery.
This PR adds:
DB Changes
Adds migration:
crates/api-db/migrations/20260624120000_machine_validation_heartbeat_recovery.sqlSchema changes:
machine_validation.last_heartbeat_at.Compatibility
Related issues
Fixes #454
Type of Change
Breaking Changes
Testing
Additional Notes
Refs #454
Design: #2070
Deployment note: