feat(machine-validation): Implement M2 machine validation heartbeat recovery by sunilkumar-nvidia · Pull Request #2838 · NVIDIA/infra-controller

sunilkumar-nvidia · 2026-06-24T10:26:33Z

Summary

This change makes machine validation more reliable when Scout or a validation test gets stuck, crashes, or stops reporting back.

Before this, the system mostly knew: “validation started” and later “validation finished.” If Scout died in the middle, or a test stopped running but never sent a final result, the API could be left waiting until a broader timeout path cleaned it up.

Now we added a “heartbeat” mechanism:

Scout tells the API periodically: “I am still running this validation test.”
The API stores the latest heartbeat time.
If heartbeats stop for too long, the API marks that specific validation attempt as failed.
The machine is then unblocked and moved to the correct failed state instead of hanging indefinitely.
If Scout already sent all test results but missed the final “run completed” call, the API can now complete the run by looking at the test item states.

Why you should care:

Machines should not get stuck in validation forever.
Failures become clearer: “this validation attempt stopped heartbeating.”
Operators can tell whether validation is active, stale, failed, or completed.
The system becomes safer for production workflows because it can recover from Scout crashes and missed completion events.
Old behavior still works for older Scouts, so this is not a risky all-or-nothing rollout.

Description

Implements the M2 machine validation reliability work from the design in #2070, covering run/attempt heartbeats and stale validation recovery.

This PR adds:

Run-level and active-attempt heartbeat support from Scout to API.
Heartbeat persistence on machine validation runs, run items, and attempts.
Reconciliation for stale active attempts when Scout stops heartbeating.
Reconciliation for runs where run items reached terminal state but final run completion was missed.
Compatibility fallback for older Scouts that do not send heartbeats, using the existing duration-based stale-run timeout path.
Tests for heartbeat acceptance, invalid heartbeat rejection, stale attempt recovery, stale run recovery, and missed completion reconciliation.

DB Changes

Adds migration:

crates/api-db/migrations/20260624120000_machine_validation_heartbeat_recovery.sql

Schema changes:

Adds nullable machine_validation.last_heartbeat_at.
Adds partial heartbeat indexes for active validation runs, run items, and attempts.

Compatibility

Proto changes are additive.
Old Scout + new API is supported.
New Scout + old API will log heartbeat RPC failures but still continues validation execution/result persistence.
Existing duration-based stale-run reconciliation remains as fallback.

Related issues

Fixes #454

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

Refs #454
Design: #2070

Deployment note:

Apply the DB migration before rolling out the new API/Scout binaries.
The new column is nullable, so existing data and older Scout behavior remain compatible.
Older Scouts will continue using the existing result/completion flow; they simply will not emit heartbeats.

Signed-off-by: Sunil Kumar <sunilkumar@nvidia.com>

coderabbitai · 2026-06-24T10:27:01Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Summary by CodeRabbit

New Features
- Added a new machine-validation heartbeat RPC allowing targeted heartbeats (run item, attempt, or test) with an “accepted” response, including an ongoing periodic heartbeat sender for active validations.
- Added last_heartbeat_at tracking for machine-validation runs and exposed it via the API.
Bug Fixes
- Improved stale-attempt detection and reconciliation using heartbeat-aware timing, with minimum timeout clamping; stopped heartbeating now reliably transitions runs and hosts to failure.
- Centralized and corrected terminal vs. stale completion flows.
Tests
- Expanded heartbeat acceptance/rejection and added coverage for stale active-attempt reconciliation and timeout clamping behavior.

Walkthrough

This PR adds heartbeat tracking for machine validation runs, including a new database timestamp, RPC contract, server handler, persistence logic, stale reconciliation, client-side periodic heartbeats, and updated tests.

Changes

Machine Validation Heartbeat Lifecycle

Layer / File(s)	Summary
Schema and RPC contracts `crates/api-db/migrations/20260624120000_machine_validation_heartbeat_recovery.sql`, `crates/rpc/proto/forge.proto`, `crates/api-model/src/machine_validation.rs`, `crates/rpc/src/model/machine_validation.rs`	The schema adds `last_heartbeat_at` and related partial indexes; the proto adds `HeartbeatMachineValidationRun`, its request/response messages, and `MachineValidationRun.last_heartbeat_at`; the API model and RPC conversion map the new timestamp.
Heartbeat writes and stale-attempt queries `crates/api-db/src/machine_validation_execution.rs`, `crates/api-db/src/machine_validation.rs`	Heartbeat recording updates validation, run-item, and attempt timestamps; stale active attempts are queried and marked failed; stale detection in `mark_stale_if_active` now branches on `last_heartbeat_at`.
gRPC handler and RBAC wiring `crates/api-core/src/handlers/machine_validation.rs`, `crates/api-core/src/auth/internal_rbac_rules.rs`, `crates/api-core/src/api.rs`	The new handler parses heartbeat targets and returns acceptance, RBAC grants access to `Scout` and `SiteAgent`, and the API service dispatches the new RPC.
Stale reconciliation and terminal completion `crates/api-core/src/machine_validation/mod.rs`, `crates/machine-controller/src/config/machine_validation.rs`	`MachineValidationManager` clamps stale timeouts, reconciles terminal run items and stale active attempts, stale timing uses `last_heartbeat_at` when present, and failure side effects are centralized through `record_failed_validation_side_effects`.
Client heartbeat task and command execution `crates/machine-validation/src/machine_validation.rs`	The client sends an initial heartbeat, spawns a periodic 30-second heartbeat task, and aborts it around command execution and failure paths.
Heartbeat and stale-reconciliation tests `crates/api-core/src/tests/machine_validation.rs`	The tests cover heartbeat accept/reject behavior, `last_heartbeat_at` updates, stale timeout handling, and stale active attempt reconciliation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly names the new machine-validation heartbeat recovery feature and matches the main change set.
Description check	✅ Passed	The description accurately summarizes the heartbeat and stale-recovery work in the patch.
Linked Issues check	✅ Passed	The changes implement the requested durable lifecycle, heartbeats, stale-run recovery, and compatibility fallback for issue `#454`.
Out of Scope Changes check	✅ Passed	All file changes support heartbeat persistence, stale validation recovery, or related tests and schema updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

sunilkumar-nvidia · 2026-06-24T10:30:46Z

@coderabbitai review

coderabbitai · 2026-06-24T10:30:51Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

crates/machine-validation/src/machine_validation.rs (2)
413-417: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Do not discard heartbeat task panics.

After abort(), cancellation is expected, but let _ = heartbeat_task.await also hides a prior panic from the heartbeat task. Preserve cancellation handling while surfacing panics.

As per path instructions, spawned Rust background work should be joined so panics propagate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-validation/src/machine_validation.rs` around lines 413 - 417,
The heartbeat join logic is swallowing task panics by discarding the result of
heartbeat_task.await after abort() in machine_validation::machine_validation;
update the cleanup so cancellation from abort() is still treated as expected,
but any JoinError panic from the spawned heartbeat task is surfaced instead of
ignored. Use the existing heartbeat_task handling around the command_result
match to explicitly inspect the await result and propagate panics while allowing
cancellation.
Source: Path instructions

171-178: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use structured heartbeat log fields.

These new logs interpolate errors and omit validation_id / test_id, making heartbeat failures harder to correlate in logfmt output. Prefer tracing fields, e.g. error=%e, %validation_id, test_id=%test_id.

As per path instructions, Rust logs should use structured tracing fields instead of interpolation.

Also applies to: 302-304
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-validation/src/machine_validation.rs` around lines 171 - 178,
The heartbeat logging in the machine validation flow is using string
interpolation instead of structured tracing fields, which makes failures harder
to correlate. Update the logging in the machine validation heartbeat branch
(including the error path and the rejected-heartbeat path) to use structured
fields with the existing symbols like validation_id and test_id, and log the
error as a field rather than formatting it into the message. Apply the same
structured logging pattern to the other heartbeat log site mentioned in the diff
so all machine_validation.rs heartbeat logs are consistent.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-core/src/machine_validation/mod.rs`:
- Around line 107-115: The new stale attempt reconciliation in
machine_validation::run is too aggressive for legacy runs that never recorded a
heartbeat. Update the loop over find_stale_active_attempts so
reconcile_stale_attempt(txn.as_pgconn(), stale_attempt, now) only runs for
heartbeat-capable attempts, and leave heartbeat-less attempts to be handled by
the existing stale_validations path with the expected-duration-plus-grace
timeout behavior.

In `@crates/api-db/src/machine_validation_execution.rs`:
- Around line 247-255: The targeted heartbeat flow in record_heartbeat() is
updating machine_validation.last_heartbeat_at before confirming the run
item/attempt heartbeat can be accepted, so a rejected child heartbeat can still
refresh the parent run. Rework the logic around update_run_heartbeat(),
update_run_item_heartbeat(), and update_attempt_heartbeat() so the parent
heartbeat is only persisted after the child update succeeds, or ensure the
transaction is aborted/rolled back whenever any targeted heartbeat returns
false. Keep the all-or-nothing behavior inside
machine_validation_execution::record_heartbeat() so machine_validation.rs does
not commit a partial heartbeat update.

In `@crates/machine-validation/src/machine_validation.rs`:
- Around line 400-403: Start the heartbeat before any potentially long-running
setup in machine_validation::MachineValidation and keep it active for the entire
test execution path. Move the spawn_machine_validation_heartbeat call so it
begins before external config fetch, precondition execution, and container pull,
and make sure every early return in this flow still cancels the heartbeat task.
Use the existing validation_id/test.test_id flow and the heartbeat_task handle
to ensure the task is always started and cleaned up around all branches.
- Around line 297-305: The initial heartbeat handling in machine_validation.rs
should stop execution when heartbeat_machine_validation_run returns Ok(false)
instead of just logging it. Update the match in the initial heartbeat block so a
rejected heartbeat produces a terminal/skipped outcome and prevents the later
config fetch and command execution, while keeping the Ok(true) and Err(e) paths
unchanged. Use the existing heartbeat_machine_validation_run flow and
surrounding validation run/test context to short-circuit before the command path
continues.
- Around line 40-41: The fixed heartbeat interval in machine validation can
outlive a user-configured stale_run_timeout, causing healthy runs to be marked
stale. Update the heartbeat logic in machine_validation.rs around
MACHINE_VALIDATION_HEARTBEAT_INTERVAL and the code that uses stale_run_timeout
so the cadence is configurable or the timeout is clamped to a minimum above the
heartbeat interval. Make sure the relevant machine validation path enforces this
relationship wherever the heartbeat is scheduled.

In `@crates/rpc/proto/forge.proto`:
- Around line 6326-6330: The MachineValidationHeartbeatRequest target fields are
currently modeled as independent optionals, which allows conflicting identifiers
to be sent together and makes the heartbeat selector ambiguous. Update the
protobuf definition to group run_item_id, attempt_id, and test_id inside a oneof
in MachineValidationHeartbeatRequest, while preserving the ability to omit the
selector entirely for run-level heartbeats. Keep the existing field numbers and
ensure the generated client surface reflects the exclusive-choice contract so
callers cannot construct invalid combinations.

---

Nitpick comments:
In `@crates/machine-validation/src/machine_validation.rs`:
- Around line 413-417: The heartbeat join logic is swallowing task panics by
discarding the result of heartbeat_task.await after abort() in
machine_validation::machine_validation; update the cleanup so cancellation from
abort() is still treated as expected, but any JoinError panic from the spawned
heartbeat task is surfaced instead of ignored. Use the existing heartbeat_task
handling around the command_result match to explicitly inspect the await result
and propagate panics while allowing cancellation.
- Around line 171-178: The heartbeat logging in the machine validation flow is
using string interpolation instead of structured tracing fields, which makes
failures harder to correlate. Update the logging in the machine validation
heartbeat branch (including the error path and the rejected-heartbeat path) to
use structured fields with the existing symbols like validation_id and test_id,
and log the error as a field rather than formatting it into the message. Apply
the same structured logging pattern to the other heartbeat log site mentioned in
the diff so all machine_validation.rs heartbeat logs are consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: badee611-89de-4527-bb88-8062a98719b6

📥 Commits

Reviewing files that changed from the base of the PR and between 863314d and 9b29790.

📒 Files selected for processing (12)

crates/api-core/src/api.rs
crates/api-core/src/auth/internal_rbac_rules.rs
crates/api-core/src/handlers/machine_validation.rs
crates/api-core/src/machine_validation/mod.rs
crates/api-core/src/tests/machine_validation.rs
crates/api-db/migrations/20260624120000_machine_validation_heartbeat_recovery.sql
crates/api-db/src/machine_validation.rs
crates/api-db/src/machine_validation_execution.rs
crates/api-model/src/machine_validation.rs
crates/machine-validation/src/machine_validation.rs
crates/rpc/proto/forge.proto
crates/rpc/src/model/machine_validation.rs

github-actions · 2026-06-24T11:55:45Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	283	6	24	98	7	148
machine-validation-runner	744	32	188	267	36	221
machine_validation	744	32	188	267	36	221
machine_validation-aarch64	744	32	188	267	36	221
nvmetal-carbide	744	32	188	267	36	221
TOTAL	3265	134	776	1172	151	1032

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/machine-validation/src/machine_validation.rs`:
- Line 430: The heartbeat is being stopped too early in
execute_machinevalidation_command, which can let the attempt go stale before the
terminal result is saved. Keep the heartbeat task running through the extra
output/error file reads and the run() persistence path, and only call
stop_machine_validation_heartbeat after persist_validation_result has succeeded
or after a definitive failure in the result persistence flow. Use the
execute_machinevalidation_command, stop_machine_validation_heartbeat, and
persist_validation_result flow to place the stop at the end of the full
validation lifecycle.
- Around line 43-45: The heartbeat cancellation path in
stop_machine_validation_heartbeat currently aborts and then discards the
JoinHandle result, and the handle can also be dropped implicitly on async
cancellation after the validation wait. Introduce a small guard around the
heartbeat_task that aborts on Drop so abandoned validations cannot keep
heartbeating, and update the explicit stop() cleanup to await the JoinHandle and
handle JoinError::is_panic() separately from expected cancellation. Keep the fix
localized around stop_machine_validation_heartbeat and the validation flow that
awaits the heartbeat task so panic visibility is preserved while normal aborts
stay quiet.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1e2cb6f0-dce2-4375-9b54-9d33e0571069

📥 Commits

Reviewing files that changed from the base of the PR and between 9b29790 and 9e7e63c.

📒 Files selected for processing (5)

crates/api-core/src/handlers/machine_validation.rs
crates/api-core/src/machine_validation/mod.rs
crates/api-core/src/tests/machine_validation.rs
crates/machine-validation/src/machine_validation.rs
crates/rpc/proto/forge.proto

🚧 Files skipped from review as they are similar to previous changes (4)

crates/api-core/src/handlers/machine_validation.rs
crates/rpc/proto/forge.proto
crates/api-core/src/machine_validation/mod.rs
crates/api-core/src/tests/machine_validation.rs

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/machine-validation/src/machine_validation.rs (1)
597-610: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Add a run-level heartbeat around the full validation lifecycle.

The loop only manages the per-test heartbeat after execute_machinevalidation_command() starts. get_container_auth_config() and get_container_images() run before this and can be long-running, leaving the validation run without a fresh run-level heartbeat even though Scout is alive. Start a run-level heartbeat before top-level preparation and keep it alive until all tests are persisted, while retaining the per-test heartbeat for active attempts. As per path instructions, review Rust code against STYLE_GUIDE.md with emphasis on “joined/cancellable background tasks” and “resource lifetimes”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-validation/src/machine_validation.rs` around lines 597 - 610,
The validation loop only starts the heartbeat after
execute_machinevalidation_command(), so the prep work in
get_container_auth_config() and get_container_images() can run without any
run-level heartbeat. Update machine_validation.rs around the validation
lifecycle in the validation loop to start a run-level heartbeat before any
top-level preparation begins, keep it active through all tests and
persist(Some(result)) calls, and stop it only after the full run is finished.
Retain the existing per-test heartbeat inside
execute_machinevalidation_command() for active attempts, and make sure the new
heartbeat task follows the joined/cancellable background task and resource
lifetime patterns used elsewhere.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/machine-validation/src/machine_validation.rs`:
- Around line 597-610: The validation loop only starts the heartbeat after
execute_machinevalidation_command(), so the prep work in
get_container_auth_config() and get_container_images() can run without any
run-level heartbeat. Update machine_validation.rs around the validation
lifecycle in the validation loop to start a run-level heartbeat before any
top-level preparation begins, keep it active through all tests and
persist(Some(result)) calls, and stop it only after the full run is finished.
Retain the existing per-test heartbeat inside
execute_machinevalidation_command() for active attempts, and make sure the new
heartbeat task follows the joined/cancellable background task and resource
lifetime patterns used elsewhere.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d491e09-b8a1-4133-a989-559e80ea25aa

📥 Commits

Reviewing files that changed from the base of the PR and between 9e7e63c and 59b551f.

📒 Files selected for processing (1)

crates/machine-validation/src/machine_validation.rs

coderabbitai

🧹 Nitpick comments (1)

crates/machine-controller/src/config/machine_validation.rs (1)
123-154: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Prefer table-driven tests for clamp cases.

These tests are correct, but this input→output behavior fits the repository’s preferred table-driven style (value_scenarios!/check_values) for easier extension with additional boundary rows.

As per coding guidelines, "Prefer table-driven tests using the carbide-test-support crate with scenarios! for fallible operations and value_scenarios! for total operations." As per path instructions, "crates/**/*.rs: Review Rust code against STYLE_GUIDE.md ... missing tests over style-only comments."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/config/machine_validation.rs` around lines 123
- 154, The two `with_minimum_stale_run_timeout` tests are correct but should be
rewritten in the repository’s table-driven style instead of separate cases.
Update the `tests` module in `machine_validation.rs` to use
`value_scenarios!`/`check_values` for the clamp behavior around
`MachineValidationConfig::with_minimum_stale_run_timeout`, covering both the
low-value clamp and the preserved-safe-value case in one extensible table. Keep
the assertions against `MachineValidationConfig::MIN_STALE_RUN_TIMEOUT` and a
safe configured timeout, but express them as scenario rows so additional
boundary cases can be added easily.
Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/machine-controller/src/config/machine_validation.rs`:
- Around line 123-154: The two `with_minimum_stale_run_timeout` tests are
correct but should be rewritten in the repository’s table-driven style instead
of separate cases. Update the `tests` module in `machine_validation.rs` to use
`value_scenarios!`/`check_values` for the clamp behavior around
`MachineValidationConfig::with_minimum_stale_run_timeout`, covering both the
low-value clamp and the preserved-safe-value case in one extensible table. Keep
the assertions against `MachineValidationConfig::MIN_STALE_RUN_TIMEOUT` and a
safe configured timeout, but express them as scenario rows so additional
boundary cases can be added easily.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e73627fe-6be2-4bb5-9935-bf8d6f7b9ece

📥 Commits

Reviewing files that changed from the base of the PR and between da70cc5 and 1850bd3.

📒 Files selected for processing (2)

crates/api-core/src/machine_validation/mod.rs
crates/machine-controller/src/config/machine_validation.rs

🚧 Files skipped from review as they are similar to previous changes (1)

crates/api-core/src/machine_validation/mod.rs

sunilkumar-nvidia · 2026-06-26T04:39:22Z

Thanks @wminckler for the feedback and for the approval. I'll raise separate PR to address the feedback and will merge this one.

…back (#2911) Addresses the review feedback on the machine validation stale-run changes (#2838). Summary - Simplified stale timeout clamping logic. - Kept the minimum stale timeout enforcement so healthy runs do not go stale between heartbeats. - Added the stale reconciliation error message to logs. - Added focused DB/API test coverage for stale-run detection and reconciliation. ## Related issues Fixes #454 ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Breaking Changes   - [ ] **This PR contains breaking changes** ## Testing  - [x] Unit tests added/updated - [x] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes

feat: add machine validation heartbeat recovery

9b29790

Signed-off-by: Sunil Kumar <sunilkumar@nvidia.com>

sunilkumar-nvidia self-assigned this Jun 24, 2026

sunilkumar-nvidia requested a review from a team as a code owner June 24, 2026 10:26

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

sunilkumar-nvidia added 2 commits June 24, 2026 12:04

fix: address machine validation heartbeat review feedback

9e7e63c

Merge remote-tracking branch 'nvidia/main' into MV_2.0_M2

f6d984a

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread crates/machine-validation/src/machine_validation.rs Outdated

Comment thread crates/machine-validation/src/machine_validation.rs Outdated

fix: harden machine validation heartbeat task lifecycle

59b551f

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

fix: document machine validation stale timeout floor

da70cc5

sunilkumar-nvidia mentioned this pull request Jun 24, 2026

feat: Machine Validation - Reliable Machine Validation Run Lifecycle #454

Open

2 tasks

fix: enforce machine validation stale timeout floor

1850bd3

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

wminckler reviewed Jun 25, 2026

View reviewed changes

Comment thread crates/api-core/src/machine_validation/mod.rs

wminckler reviewed Jun 25, 2026

View reviewed changes

Comment thread crates/api-core/src/machine_validation/mod.rs

wminckler reviewed Jun 25, 2026

View reviewed changes

Comment thread crates/api-db/src/machine_validation.rs

wminckler reviewed Jun 25, 2026

View reviewed changes

Comment thread crates/api-db/src/machine_validation_execution.rs

wminckler approved these changes Jun 25, 2026

View reviewed changes

sunilkumar-nvidia merged commit 54a2b7a into NVIDIA:main Jun 26, 2026
58 checks passed

sunilkumar-nvidia mentioned this pull request Jun 26, 2026

fix(machine-validation): address machine validation stale review feedback #2911

Merged

10 tasks

coderabbitai Bot mentioned this pull request Jun 26, 2026

CI: Prevent PR's that result in uncommitted changes #2884

Merged

10 tasks

Uh oh!

Conversation

sunilkumar-nvidia commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

DB Changes

Compatibility

Related issues

Type of Change

Breaking Changes

Testing

Additional Notes

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

sunilkumar-nvidia commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Container Scan Summary

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sunilkumar-nvidia commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sunilkumar-nvidia commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

github-actions Bot commented Jun 24, 2026 •

edited

Loading