Skip to content

feat(machine-validation): Implement M2 machine validation heartbeat recovery#2838

Merged
sunilkumar-nvidia merged 6 commits into
NVIDIA:mainfrom
sunilkumar-nvidia:MV_2.0_M2
Jun 26, 2026
Merged

feat(machine-validation): Implement M2 machine validation heartbeat recovery#2838
sunilkumar-nvidia merged 6 commits into
NVIDIA:mainfrom
sunilkumar-nvidia:MV_2.0_M2

Conversation

@sunilkumar-nvidia

@sunilkumar-nvidia sunilkumar-nvidia commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

This change makes machine validation more reliable when Scout or a validation test gets stuck, crashes, or stops reporting back.

Before this, the system mostly knew: “validation started” and later “validation finished.” If Scout died in the middle, or a test stopped running but never sent a final result, the API could be left waiting until a broader timeout path cleaned it up.

Now we added a “heartbeat” mechanism:

  • Scout tells the API periodically: “I am still running this validation test.”
  • The API stores the latest heartbeat time.
  • If heartbeats stop for too long, the API marks that specific validation attempt as failed.
  • The machine is then unblocked and moved to the correct failed state instead of hanging indefinitely.
  • If Scout already sent all test results but missed the final “run completed” call, the API can now complete the run by looking at the test item states.

Why you should care:

  • Machines should not get stuck in validation forever.
  • Failures become clearer: “this validation attempt stopped heartbeating.”
  • Operators can tell whether validation is active, stale, failed, or completed.
  • The system becomes safer for production workflows because it can recover from Scout crashes and missed completion events.
  • Old behavior still works for older Scouts, so this is not a risky all-or-nothing rollout.

Description

Implements the M2 machine validation reliability work from the design in #2070, covering run/attempt heartbeats and stale validation recovery.

This PR adds:

  • Run-level and active-attempt heartbeat support from Scout to API.
  • Heartbeat persistence on machine validation runs, run items, and attempts.
  • Reconciliation for stale active attempts when Scout stops heartbeating.
  • Reconciliation for runs where run items reached terminal state but final run completion was missed.
  • Compatibility fallback for older Scouts that do not send heartbeats, using the existing duration-based stale-run timeout path.
  • Tests for heartbeat acceptance, invalid heartbeat rejection, stale attempt recovery, stale run recovery, and missed completion reconciliation.

DB Changes

Adds migration:

  • crates/api-db/migrations/20260624120000_machine_validation_heartbeat_recovery.sql

Schema changes:

  • Adds nullable machine_validation.last_heartbeat_at.
  • Adds partial heartbeat indexes for active validation runs, run items, and attempts.

Compatibility

  • Proto changes are additive.
  • Old Scout + new API is supported.
  • New Scout + old API will log heartbeat RPC failures but still continues validation execution/result persistence.
  • Existing duration-based stale-run reconciliation remains as fallback.

Related issues

Fixes #454

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Refs #454
Design: #2070

Deployment note:

  • Apply the DB migration before rolling out the new API/Scout binaries.
  • The new column is nullable, so existing data and older Scout behavior remain compatible.
  • Older Scouts will continue using the existing result/completion flow; they simply will not emit heartbeats.

Signed-off-by: Sunil Kumar <sunilkumar@nvidia.com>
@sunilkumar-nvidia sunilkumar-nvidia self-assigned this Jun 24, 2026
@sunilkumar-nvidia sunilkumar-nvidia requested a review from a team as a code owner June 24, 2026 10:26
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Summary by CodeRabbit

  • New Features
    • Added a new machine-validation heartbeat RPC allowing targeted heartbeats (run item, attempt, or test) with an “accepted” response, including an ongoing periodic heartbeat sender for active validations.
    • Added last_heartbeat_at tracking for machine-validation runs and exposed it via the API.
  • Bug Fixes
    • Improved stale-attempt detection and reconciliation using heartbeat-aware timing, with minimum timeout clamping; stopped heartbeating now reliably transitions runs and hosts to failure.
    • Centralized and corrected terminal vs. stale completion flows.
  • Tests
    • Expanded heartbeat acceptance/rejection and added coverage for stale active-attempt reconciliation and timeout clamping behavior.

Walkthrough

This PR adds heartbeat tracking for machine validation runs, including a new database timestamp, RPC contract, server handler, persistence logic, stale reconciliation, client-side periodic heartbeats, and updated tests.

Changes

Machine Validation Heartbeat Lifecycle

Layer / File(s) Summary
Schema and RPC contracts
crates/api-db/migrations/20260624120000_machine_validation_heartbeat_recovery.sql, crates/rpc/proto/forge.proto, crates/api-model/src/machine_validation.rs, crates/rpc/src/model/machine_validation.rs
The schema adds last_heartbeat_at and related partial indexes; the proto adds HeartbeatMachineValidationRun, its request/response messages, and MachineValidationRun.last_heartbeat_at; the API model and RPC conversion map the new timestamp.
Heartbeat writes and stale-attempt queries
crates/api-db/src/machine_validation_execution.rs, crates/api-db/src/machine_validation.rs
Heartbeat recording updates validation, run-item, and attempt timestamps; stale active attempts are queried and marked failed; stale detection in mark_stale_if_active now branches on last_heartbeat_at.
gRPC handler and RBAC wiring
crates/api-core/src/handlers/machine_validation.rs, crates/api-core/src/auth/internal_rbac_rules.rs, crates/api-core/src/api.rs
The new handler parses heartbeat targets and returns acceptance, RBAC grants access to Scout and SiteAgent, and the API service dispatches the new RPC.
Stale reconciliation and terminal completion
crates/api-core/src/machine_validation/mod.rs, crates/machine-controller/src/config/machine_validation.rs
MachineValidationManager clamps stale timeouts, reconciles terminal run items and stale active attempts, stale timing uses last_heartbeat_at when present, and failure side effects are centralized through record_failed_validation_side_effects.
Client heartbeat task and command execution
crates/machine-validation/src/machine_validation.rs
The client sends an initial heartbeat, spawns a periodic 30-second heartbeat task, and aborts it around command execution and failure paths.
Heartbeat and stale-reconciliation tests
crates/api-core/src/tests/machine_validation.rs
The tests cover heartbeat accept/reject behavior, last_heartbeat_at updates, stale timeout handling, and stale active attempt reconciliation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly names the new machine-validation heartbeat recovery feature and matches the main change set.
Description check ✅ Passed The description accurately summarizes the heartbeat and stale-recovery work in the patch.
Linked Issues check ✅ Passed The changes implement the requested durable lifecycle, heartbeats, stale-run recovery, and compatibility fallback for issue #454.
Out of Scope Changes check ✅ Passed All file changes support heartbeat persistence, stale validation recovery, or related tests and schema updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@sunilkumar-nvidia

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
crates/machine-validation/src/machine_validation.rs (2)

413-417: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Do not discard heartbeat task panics.

After abort(), cancellation is expected, but let _ = heartbeat_task.await also hides a prior panic from the heartbeat task. Preserve cancellation handling while surfacing panics.

As per path instructions, spawned Rust background work should be joined so panics propagate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-validation/src/machine_validation.rs` around lines 413 - 417,
The heartbeat join logic is swallowing task panics by discarding the result of
heartbeat_task.await after abort() in machine_validation::machine_validation;
update the cleanup so cancellation from abort() is still treated as expected,
but any JoinError panic from the spawned heartbeat task is surfaced instead of
ignored. Use the existing heartbeat_task handling around the command_result
match to explicitly inspect the await result and propagate panics while allowing
cancellation.

Source: Path instructions


171-178: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use structured heartbeat log fields.

These new logs interpolate errors and omit validation_id / test_id, making heartbeat failures harder to correlate in logfmt output. Prefer tracing fields, e.g. error=%e, %validation_id, test_id=%test_id.

As per path instructions, Rust logs should use structured tracing fields instead of interpolation.

Also applies to: 302-304

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-validation/src/machine_validation.rs` around lines 171 - 178,
The heartbeat logging in the machine validation flow is using string
interpolation instead of structured tracing fields, which makes failures harder
to correlate. Update the logging in the machine validation heartbeat branch
(including the error path and the rejected-heartbeat path) to use structured
fields with the existing symbols like validation_id and test_id, and log the
error as a field rather than formatting it into the message. Apply the same
structured logging pattern to the other heartbeat log site mentioned in the diff
so all machine_validation.rs heartbeat logs are consistent.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-core/src/machine_validation/mod.rs`:
- Around line 107-115: The new stale attempt reconciliation in
machine_validation::run is too aggressive for legacy runs that never recorded a
heartbeat. Update the loop over find_stale_active_attempts so
reconcile_stale_attempt(txn.as_pgconn(), stale_attempt, now) only runs for
heartbeat-capable attempts, and leave heartbeat-less attempts to be handled by
the existing stale_validations path with the expected-duration-plus-grace
timeout behavior.

In `@crates/api-db/src/machine_validation_execution.rs`:
- Around line 247-255: The targeted heartbeat flow in record_heartbeat() is
updating machine_validation.last_heartbeat_at before confirming the run
item/attempt heartbeat can be accepted, so a rejected child heartbeat can still
refresh the parent run. Rework the logic around update_run_heartbeat(),
update_run_item_heartbeat(), and update_attempt_heartbeat() so the parent
heartbeat is only persisted after the child update succeeds, or ensure the
transaction is aborted/rolled back whenever any targeted heartbeat returns
false. Keep the all-or-nothing behavior inside
machine_validation_execution::record_heartbeat() so machine_validation.rs does
not commit a partial heartbeat update.

In `@crates/machine-validation/src/machine_validation.rs`:
- Around line 400-403: Start the heartbeat before any potentially long-running
setup in machine_validation::MachineValidation and keep it active for the entire
test execution path. Move the spawn_machine_validation_heartbeat call so it
begins before external config fetch, precondition execution, and container pull,
and make sure every early return in this flow still cancels the heartbeat task.
Use the existing validation_id/test.test_id flow and the heartbeat_task handle
to ensure the task is always started and cleaned up around all branches.
- Around line 297-305: The initial heartbeat handling in machine_validation.rs
should stop execution when heartbeat_machine_validation_run returns Ok(false)
instead of just logging it. Update the match in the initial heartbeat block so a
rejected heartbeat produces a terminal/skipped outcome and prevents the later
config fetch and command execution, while keeping the Ok(true) and Err(e) paths
unchanged. Use the existing heartbeat_machine_validation_run flow and
surrounding validation run/test context to short-circuit before the command path
continues.
- Around line 40-41: The fixed heartbeat interval in machine validation can
outlive a user-configured stale_run_timeout, causing healthy runs to be marked
stale. Update the heartbeat logic in machine_validation.rs around
MACHINE_VALIDATION_HEARTBEAT_INTERVAL and the code that uses stale_run_timeout
so the cadence is configurable or the timeout is clamped to a minimum above the
heartbeat interval. Make sure the relevant machine validation path enforces this
relationship wherever the heartbeat is scheduled.

In `@crates/rpc/proto/forge.proto`:
- Around line 6326-6330: The MachineValidationHeartbeatRequest target fields are
currently modeled as independent optionals, which allows conflicting identifiers
to be sent together and makes the heartbeat selector ambiguous. Update the
protobuf definition to group run_item_id, attempt_id, and test_id inside a oneof
in MachineValidationHeartbeatRequest, while preserving the ability to omit the
selector entirely for run-level heartbeats. Keep the existing field numbers and
ensure the generated client surface reflects the exclusive-choice contract so
callers cannot construct invalid combinations.

---

Nitpick comments:
In `@crates/machine-validation/src/machine_validation.rs`:
- Around line 413-417: The heartbeat join logic is swallowing task panics by
discarding the result of heartbeat_task.await after abort() in
machine_validation::machine_validation; update the cleanup so cancellation from
abort() is still treated as expected, but any JoinError panic from the spawned
heartbeat task is surfaced instead of ignored. Use the existing heartbeat_task
handling around the command_result match to explicitly inspect the await result
and propagate panics while allowing cancellation.
- Around line 171-178: The heartbeat logging in the machine validation flow is
using string interpolation instead of structured tracing fields, which makes
failures harder to correlate. Update the logging in the machine validation
heartbeat branch (including the error path and the rejected-heartbeat path) to
use structured fields with the existing symbols like validation_id and test_id,
and log the error as a field rather than formatting it into the message. Apply
the same structured logging pattern to the other heartbeat log site mentioned in
the diff so all machine_validation.rs heartbeat logs are consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: badee611-89de-4527-bb88-8062a98719b6

📥 Commits

Reviewing files that changed from the base of the PR and between 863314d and 9b29790.

📒 Files selected for processing (12)
  • crates/api-core/src/api.rs
  • crates/api-core/src/auth/internal_rbac_rules.rs
  • crates/api-core/src/handlers/machine_validation.rs
  • crates/api-core/src/machine_validation/mod.rs
  • crates/api-core/src/tests/machine_validation.rs
  • crates/api-db/migrations/20260624120000_machine_validation_heartbeat_recovery.sql
  • crates/api-db/src/machine_validation.rs
  • crates/api-db/src/machine_validation_execution.rs
  • crates/api-model/src/machine_validation.rs
  • crates/machine-validation/src/machine_validation.rs
  • crates/rpc/proto/forge.proto
  • crates/rpc/src/model/machine_validation.rs

Comment thread crates/api-core/src/machine_validation/mod.rs
Comment thread crates/api-db/src/machine_validation_execution.rs
Comment thread crates/machine-validation/src/machine_validation.rs
Comment thread crates/machine-validation/src/machine_validation.rs
Comment thread crates/machine-validation/src/machine_validation.rs Outdated
Comment thread crates/rpc/proto/forge.proto Outdated
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 283 6 24 98 7 148
machine-validation-runner 744 32 188 267 36 221
machine_validation 744 32 188 267 36 221
machine_validation-aarch64 744 32 188 267 36 221
nvmetal-carbide 744 32 188 267 36 221
TOTAL 3265 134 776 1172 151 1032

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/machine-validation/src/machine_validation.rs`:
- Line 430: The heartbeat is being stopped too early in
execute_machinevalidation_command, which can let the attempt go stale before the
terminal result is saved. Keep the heartbeat task running through the extra
output/error file reads and the run() persistence path, and only call
stop_machine_validation_heartbeat after persist_validation_result has succeeded
or after a definitive failure in the result persistence flow. Use the
execute_machinevalidation_command, stop_machine_validation_heartbeat, and
persist_validation_result flow to place the stop at the end of the full
validation lifecycle.
- Around line 43-45: The heartbeat cancellation path in
stop_machine_validation_heartbeat currently aborts and then discards the
JoinHandle result, and the handle can also be dropped implicitly on async
cancellation after the validation wait. Introduce a small guard around the
heartbeat_task that aborts on Drop so abandoned validations cannot keep
heartbeating, and update the explicit stop() cleanup to await the JoinHandle and
handle JoinError::is_panic() separately from expected cancellation. Keep the fix
localized around stop_machine_validation_heartbeat and the validation flow that
awaits the heartbeat task so panic visibility is preserved while normal aborts
stay quiet.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1e2cb6f0-dce2-4375-9b54-9d33e0571069

📥 Commits

Reviewing files that changed from the base of the PR and between 9b29790 and 9e7e63c.

📒 Files selected for processing (5)
  • crates/api-core/src/handlers/machine_validation.rs
  • crates/api-core/src/machine_validation/mod.rs
  • crates/api-core/src/tests/machine_validation.rs
  • crates/machine-validation/src/machine_validation.rs
  • crates/rpc/proto/forge.proto
🚧 Files skipped from review as they are similar to previous changes (4)
  • crates/api-core/src/handlers/machine_validation.rs
  • crates/rpc/proto/forge.proto
  • crates/api-core/src/machine_validation/mod.rs
  • crates/api-core/src/tests/machine_validation.rs

Comment thread crates/machine-validation/src/machine_validation.rs Outdated
Comment thread crates/machine-validation/src/machine_validation.rs Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/machine-validation/src/machine_validation.rs (1)

597-610: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Add a run-level heartbeat around the full validation lifecycle.

The loop only manages the per-test heartbeat after execute_machinevalidation_command() starts. get_container_auth_config() and get_container_images() run before this and can be long-running, leaving the validation run without a fresh run-level heartbeat even though Scout is alive. Start a run-level heartbeat before top-level preparation and keep it alive until all tests are persisted, while retaining the per-test heartbeat for active attempts. As per path instructions, review Rust code against STYLE_GUIDE.md with emphasis on “joined/cancellable background tasks” and “resource lifetimes”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-validation/src/machine_validation.rs` around lines 597 - 610,
The validation loop only starts the heartbeat after
execute_machinevalidation_command(), so the prep work in
get_container_auth_config() and get_container_images() can run without any
run-level heartbeat. Update machine_validation.rs around the validation
lifecycle in the validation loop to start a run-level heartbeat before any
top-level preparation begins, keep it active through all tests and
persist(Some(result)) calls, and stop it only after the full run is finished.
Retain the existing per-test heartbeat inside
execute_machinevalidation_command() for active attempts, and make sure the new
heartbeat task follows the joined/cancellable background task and resource
lifetime patterns used elsewhere.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/machine-validation/src/machine_validation.rs`:
- Around line 597-610: The validation loop only starts the heartbeat after
execute_machinevalidation_command(), so the prep work in
get_container_auth_config() and get_container_images() can run without any
run-level heartbeat. Update machine_validation.rs around the validation
lifecycle in the validation loop to start a run-level heartbeat before any
top-level preparation begins, keep it active through all tests and
persist(Some(result)) calls, and stop it only after the full run is finished.
Retain the existing per-test heartbeat inside
execute_machinevalidation_command() for active attempts, and make sure the new
heartbeat task follows the joined/cancellable background task and resource
lifetime patterns used elsewhere.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d491e09-b8a1-4133-a989-559e80ea25aa

📥 Commits

Reviewing files that changed from the base of the PR and between 9e7e63c and 59b551f.

📒 Files selected for processing (1)
  • crates/machine-validation/src/machine_validation.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/machine-controller/src/config/machine_validation.rs (1)

123-154: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Prefer table-driven tests for clamp cases.

These tests are correct, but this input→output behavior fits the repository’s preferred table-driven style (value_scenarios!/check_values) for easier extension with additional boundary rows.

As per coding guidelines, "Prefer table-driven tests using the carbide-test-support crate with scenarios! for fallible operations and value_scenarios! for total operations." As per path instructions, "crates/**/*.rs: Review Rust code against STYLE_GUIDE.md ... missing tests over style-only comments."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/config/machine_validation.rs` around lines 123
- 154, The two `with_minimum_stale_run_timeout` tests are correct but should be
rewritten in the repository’s table-driven style instead of separate cases.
Update the `tests` module in `machine_validation.rs` to use
`value_scenarios!`/`check_values` for the clamp behavior around
`MachineValidationConfig::with_minimum_stale_run_timeout`, covering both the
low-value clamp and the preserved-safe-value case in one extensible table. Keep
the assertions against `MachineValidationConfig::MIN_STALE_RUN_TIMEOUT` and a
safe configured timeout, but express them as scenario rows so additional
boundary cases can be added easily.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/machine-controller/src/config/machine_validation.rs`:
- Around line 123-154: The two `with_minimum_stale_run_timeout` tests are
correct but should be rewritten in the repository’s table-driven style instead
of separate cases. Update the `tests` module in `machine_validation.rs` to use
`value_scenarios!`/`check_values` for the clamp behavior around
`MachineValidationConfig::with_minimum_stale_run_timeout`, covering both the
low-value clamp and the preserved-safe-value case in one extensible table. Keep
the assertions against `MachineValidationConfig::MIN_STALE_RUN_TIMEOUT` and a
safe configured timeout, but express them as scenario rows so additional
boundary cases can be added easily.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e73627fe-6be2-4bb5-9935-bf8d6f7b9ece

📥 Commits

Reviewing files that changed from the base of the PR and between da70cc5 and 1850bd3.

📒 Files selected for processing (2)
  • crates/api-core/src/machine_validation/mod.rs
  • crates/machine-controller/src/config/machine_validation.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/api-core/src/machine_validation/mod.rs

Comment thread crates/api-core/src/machine_validation/mod.rs
Comment thread crates/api-core/src/machine_validation/mod.rs
Comment thread crates/api-db/src/machine_validation.rs
Comment thread crates/api-db/src/machine_validation_execution.rs
@sunilkumar-nvidia

Copy link
Copy Markdown
Contributor Author

Thanks @wminckler for the feedback and for the approval. I'll raise separate PR to address the feedback and will merge this one.

@sunilkumar-nvidia sunilkumar-nvidia merged commit 54a2b7a into NVIDIA:main Jun 26, 2026
58 checks passed
ajf pushed a commit that referenced this pull request Jun 26, 2026
…back (#2911)

Addresses the review feedback on the machine validation stale-run
changes (#2838).

 Summary

  - Simplified stale timeout clamping logic.
- Kept the minimum stale timeout enforcement so healthy runs do not go
stale between heartbeats.
  - Added the stale reconciliation error message to logs.
- Added focused DB/API test coverage for stale-run detection and
reconciliation.
  

## Related issues
Fixes #454

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Breaking Changes
<!-- If checked, describe the breaking changes and migration steps -->
<!-- Breaking changes are not generally permitted, please discuss on a
GitHub discussion or with the development team if you believe you need
to break a backward compatibility guarantee -->
- [ ] **This PR contains breaking changes**

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [x] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Machine Validation - Reliable Machine Validation Run Lifecycle

2 participants