fix(scout): fall back NVMe secure-erase -s2->-s1->-s0 (fixes #2820) by kirson-git · Pull Request #2835 · NVIDIA/infra-controller

kirson-git · 2026-06-24T07:29:32Z

Problem (fixes #2820)

Host provisioning fails at disk cleanup with Failed/NVMECleanFailed on drives that don't support cryptographic secure-erase. scout hardcodes nvme format <dev> -s2 (SES=2, crypto) with no fallback; such drives return Invalid Command Opcode and the whole cleanup aborts.

Observed on Dell XE9680 (NICo v0.10.3):

"nvme format /dev/nvme6 -s2 -f -n 0x1" -> NVMe status: Invalid Command Opcode (0x1)  ->  Failed/NVMECleanFailed

The drives are healthy (iDRAC: all NVMe Health=OK) — they simply don't implement crypto-erase.

Fix

In crates/scout/src/deprovision/scrabbing.rs (clean_this_nvme), try erase modes in order -s2 (crypto) -> -s1 (user-data) -> -s0 (none) and use the first the drive accepts. If none is supported, log a warning and continue rather than failing the whole provisioning (the subsequent OS install overwrites the disk).

Validation

Patch built into a custom scout image and re-run on the affected lab host. (Field-validated; CI build covers compilation.)

…ported (fixes NVIDIA#2820)

copy-pr-bot · 2026-06-24T07:29:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-24T07:29:59Z

Summary by CodeRabbit

Bug Fixes
- Improved deprovisioning so cleanup is skipped when the system is not recognized as a host.
- Made NVMe secure erase more resilient by trying multiple erase modes before proceeding.
- If no erase mode works, the process now logs a warning and continues instead of failing immediately.

Walkthrough

scrabbing.rs receives two behavioral changes: host detection in run and run_no_api is delegated from a local SMBIOS-based is_host() to platform::is_host(), and NVMe namespace secure-erase is extended from a hard -s2-only attempt to a sequential -s2 → -s1 → -s0 fallback that warns and continues rather than propagating an error when all modes are unsupported.

Changes

NVMe Erase Fallback and Host Gate Refactor

Layer / File(s)	Summary
Import cleanup for platform host detection `crates/scout/src/deprovision/scrabbing.rs`	Removes `CarbideClientResult`, `IN_QEMU_VM`, and SMBIOS-based host detection imports; adds `platform` module import to support the new host gate.
NVMe secure-erase fallback sequence `crates/scout/src/deprovision/scrabbing.rs`	Replaces single `nvme format -s2` (with conditional error propagation tied to `namespaces_supported`) with a sequential retry across `-s2`, `-s1`, and `-s0`; succeeds on the first supported mode, logs a warning and continues if all fail.
Host gate migration to `platform::is_host()` `crates/scout/src/deprovision/scrabbing.rs`	Both `run` and `run_no_api` now call `platform::is_host()` instead of the removed local helper to decide whether deprovision cleanup should proceed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is specific and accurately summarizes the NVMe secure-erase fallback change for issue `#2820`.
Description check	✅ Passed	The description directly matches the implemented NVMe cleanup fallback and failure-tolerance behavior.
Linked Issues check	✅ Passed	The change satisfies `#2820` by falling back from -s2 to -s1 to -s0 and continuing when no mode is supported.
Out of Scope Changes check	✅ Passed	The host-detection refactor appears supportive of the cleanup flow and no unrelated changes are evident.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/scout/src/deprovision/scrabbing.rs`:
- Around line 331-346: The nvme secure-erase fallback in scrabbing.rs is
swallowing every `cmdrun::run_prog` failure inside the `for ses in ["-s2",
"-s1", "-s0"]` loop and only treating it as “unsupported mode,” which can hide
real operational errors. Update the `fmt_ok`/`tracing::warn!` logic in this
`nvme format` path to distinguish unsupported SES responses from permission,
timeout, or device failures: only continue to the next SES when the error
clearly indicates unsupported mode, and otherwise return the failure from the
`nvme` formatting attempt. Use the existing `cmdrun::run_prog` call site and the
`nvme` formatting block as the place to capture and propagate the actual error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a10a2221-67df-47d4-8068-1d0c86256164

📥 Commits

Reviewing files that changed from the base of the PR and between 37b9c70 and 343acef.

📒 Files selected for processing (1)

crates/scout/src/deprovision/scrabbing.rs

coderabbitai · 2026-06-24T07:34:09Z

+            // secure erase with fallback: -s2 (crypto) -> -s1 (user-data) -> -s0 (none).
+            // Some drives reject crypto erase ("Invalid Command Opcode"); fall back, and if no
+            // mode is supported, log and continue (the OS install overwrites the disk).
+            let mut fmt_ok = false;
+            for ses in ["-s2", "-s1", "-s0"] {
+                if cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid])
+                    .await
+                    .is_ok()
+                {
+                    fmt_ok = true;
+                    break;
                }
            }
+            if !fmt_ok {
+                tracing::warn!("nvme: no supported format mode on {} ns {}; continuing", nvmename, nsid);
+            }


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Do not treat every format failure as an unsupported erase mode.

Line 336 discards the actual nvme format error, so permission/device/timeout failures can be downgraded to “unsupported” and cleanup may continue with data still present. Only suppress errors that prove the SES mode is unsupported; return operational failures.

Suggested direction

- let mut fmt_ok = false; + let mut fmt_ok = false; + let mut last_unsupported_error = None; for ses in ["-s2", "-s1", "-s0"] { - if cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]) - .await - .is_ok() - { - fmt_ok = true; - break; + match cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]).await { + Ok(_) => { + fmt_ok = true; + break; + } + Err(e) if is_unsupported_nvme_format_mode(&e) => { + tracing::debug!(%nvmename, %nsid, %ses, error=%e, "nvme format mode unsupported"); + last_unsupported_error = Some(e.to_string()); + } + Err(e) => return Err(e), } } if !fmt_ok { - tracing::warn!("nvme: no supported format mode on {} ns {}; continuing", nvmename, nsid); + tracing::warn!(%nvmename, %nsid, error=?last_unsupported_error, "nvme: no supported format mode; continuing"); }

As per coding guidelines, “When writing log messages, prefer placing common fields as attributes passed to tracing functions instead of using string interpolation.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/scout/src/deprovision/scrabbing.rs` around lines 331 - 346, The nvme secure-erase fallback in scrabbing.rs is swallowing every `cmdrun::run_prog` failure inside the `for ses in ["-s2", "-s1", "-s0"]` loop and only treating it as “unsupported mode,” which can hide real operational errors. Update the `fmt_ok`/`tracing::warn!` logic in this `nvme format` path to distinguish unsupported SES responses from permission, timeout, or device failures: only continue to the next SES when the error clearly indicates unsupported mode, and otherwise return the failure from the `nvme` formatting attempt. Use the existing `cmdrun::run_prog` call site and the `nvme` formatting block as the place to capture and propagate the actual error.

Source: Coding guidelines

yeah I'm not sure what to do if we aren't able to format nvme. Suggestions?

@ajf, yes — I’d recommend a simple policy split:

SES mode unsupported (-s2 not supported, etc.) → try the next mode.

All SES modes unsupported → warn! and continue, since the disk will be overwritten anyway.

Operational failure (I/O error, permission issue, timeout, controller/device problem) → return the error.

That preserves the PR’s intent without masking real device failures.

A minimal shape would be:

fn is_unsupported_nvme_format_mode(error: &CarbideClientError) -> bool { let message = error.to_string(); message.contains("Invalid Command Opcode") || message.contains("Invalid Field in Command") || message.contains("unsupported") } let mut fmt_ok = false; let mut last_unsupported_error = None; for ses in ["-s2", "-s1", "-s0"] { match cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]).await { Ok(_) => { fmt_ok = true; break; } Err(e) if is_unsupported_nvme_format_mode(&e) => { tracing::debug!(%nvmename, %nsid, %ses, error=%e, "nvme format mode unsupported"); last_unsupported_error = Some(e.to_string()); } Err(e) => { tracing::error!(%nvmename, %nsid, %ses, error=%e, "nvme format failed"); return Err(e); } } } if !fmt_ok { tracing::warn!( %nvmename, %nsid, error = ?last_unsupported_error, "no supported nvme format mode; continuing" ); }

If you want the most conservative interpretation, this is the one I’d ship. It only downgrades failures that look like capability mismatch, which is exactly the case this PR is trying to fix. Anything else still surfaces as a real cleanup failure.

(If cmdrun::run_prog does not preserve stderr text well enough to classify errors, then the next-best option is to make that helper more explicit first.)

Looks like the old code returned Err(_) so we should still do that. Looks like the code here could still return Ok()

@ajf, agreed — that is the key regression.

The previous logic only tolerated format failure when namespaces_supported was true, because the namespace was about to be deleted anyway. In the new loop, if -s2, -s1, and -s0 all fail, we can now fall through and return Ok(()) even when namespaces_supported is false, which means cleanup may report success without any successful erase.

I’d treat this as the minimum-safe fix:

keep the -s2 -> -s1 -> -s0 fallback

remember the last error

if all modes fail:

return Err(...) when !namespaces_supported

only warn! and continue when namespaces_supported

In other words, the post-loop behavior should preserve the old contract for the non-namespace-management path.

Something along these lines:

let mut fmt_ok = false; let mut last_error = None; for ses in ["-s2", "-s1", "-s0"] { match cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]).await { Ok(_) => { fmt_ok = true; break; } Err(e) => { tracing::debug!(%nvmename, %nsid, %ses, error=%e, "nvme format failed"); last_error = Some(e); } } } if !fmt_ok { if namespaces_supported { tracing::warn!(%nvmename, %nsid, error=?last_error, "all nvme format modes failed; continuing because namespace will be deleted"); } else { return Err(last_error.unwrap_or_else(|| { CarbideClientError::GenericError("nvme format failed for all SES modes".to_string()) })); } }

That keeps the PR objective intact while avoiding the false-Ok() path you called out.

kirson-git · 2026-06-24T07:38:02Z

Superseded by pull request create failed: GraphQL: Head sha can't be blank, Base sha can't be blank, No commits between NVIDIA:main and kirson-git:fix-scout-nvme-erase-fallback, Head ref must be a branch (createPullRequest) — same fix, clean diff (the previous branch was based on a stale fork base which made the diff show unrelated upstream refactors).

kirson-git · 2026-06-24T07:42:21Z

Note for reviewers: the diff currently shows unrelated changes (removal of is_host()/smbioslib, platform import) because this PR's fork base is behind upstream's recent refactor and the fork can't be auto-synced (token lacks workflow scope for a .github/workflows change). The 3-way merge against current main is clean, and the only functional change is the NVMe secure-erase fallback (-s2 -> -s1 -> -s0, else log+continue) in crates/scout/src/deprovision/scrabbing.rs. I'll force-rebase onto current main for a pristine diff once the fork is synced.

rwthompsonii · 2026-06-24T17:23:53Z

-            {
-                Ok(_) => (),
-                Err(e) => {
-                    if namespaces_supported {


We should validate that it's the error we expect prior to falling back -- I'm envisioning a future where we add params or something and start receiving errors other than the one we see here and then unnecessarily fall back rather than (correctly, imo) hard error so that we can find and fix this.

ajf · 2026-06-25T00:31:19Z

+                    .is_ok()
+                {
+                    fmt_ok = true;
+                    break;


more idiomatic (current is very wordy) would be something like

let got_one = ["-s2", "-s1", "-s0"].iter().any(|sess| cmdrun::run_prog(...).await.is_ok() )

(and whatever else for async handling, unless it can't be done of course.)

Since .any() stops on the first true value.

ajf · 2026-06-25T00:37:21Z

@kirson-git make sure to sign your commit with GPG or SSH. Since you're at NVIDIA, join the NVIDIA org so the DCO check will pass without using the Signed-Off-By headers and tests will automatically run.

fix(scout): fall back NVMe secure-erase -s2->-s1->-s0, tolerate unsup…

343acef

…ported (fixes NVIDIA#2820)

kirson-git requested a review from a team as a code owner June 24, 2026 07:29

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

kirson-git closed this Jun 24, 2026

kirson-git reopened this Jun 24, 2026

kirson-git mentioned this pull request Jun 24, 2026

scout: NVMe secure-erase hardcodes -s2 (crypto) with no fallback → NVMECleanFailed on unsupported drives #2820

Open

rwthompsonii approved these changes Jun 24, 2026

View reviewed changes

ajf reviewed Jun 25, 2026

View reviewed changes

Uh oh!

Conversation

kirson-git commented Jun 24, 2026

Problem (fixes #2820)

Fix

Validation

Uh oh!

copy-pr-bot Bot commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajf Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

ajf Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

kirson-git commented Jun 24, 2026

Uh oh!

kirson-git commented Jun 24, 2026

Uh oh!

rwthompsonii Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

ajf Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajf commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot Jun 24, 2026 •

edited

Loading

ajf Jun 25, 2026 •

edited

Loading