Skip to content

fix(scout): fall back NVMe secure-erase -s2->-s1->-s0 (fixes #2820)#2835

Open
kirson-git wants to merge 1 commit into
NVIDIA:mainfrom
kirson-git:fix/scout-nvme-secure-erase-fallback
Open

fix(scout): fall back NVMe secure-erase -s2->-s1->-s0 (fixes #2820)#2835
kirson-git wants to merge 1 commit into
NVIDIA:mainfrom
kirson-git:fix/scout-nvme-secure-erase-fallback

Conversation

@kirson-git

Copy link
Copy Markdown
Contributor

Problem (fixes #2820)

Host provisioning fails at disk cleanup with Failed/NVMECleanFailed on drives that don't support cryptographic secure-erase. scout hardcodes nvme format <dev> -s2 (SES=2, crypto) with no fallback; such drives return Invalid Command Opcode and the whole cleanup aborts.

Observed on Dell XE9680 (NICo v0.10.3):

"nvme format /dev/nvme6 -s2 -f -n 0x1" -> NVMe status: Invalid Command Opcode (0x1)  ->  Failed/NVMECleanFailed

The drives are healthy (iDRAC: all NVMe Health=OK) — they simply don't implement crypto-erase.

Fix

In crates/scout/src/deprovision/scrabbing.rs (clean_this_nvme), try erase modes in order -s2 (crypto) -> -s1 (user-data) -> -s0 (none) and use the first the drive accepts. If none is supported, log a warning and continue rather than failing the whole provisioning (the subsequent OS install overwrites the disk).

Validation

Patch built into a custom scout image and re-run on the affected lab host. (Field-validated; CI build covers compilation.)

@kirson-git kirson-git requested a review from a team as a code owner June 24, 2026 07:29
@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • Bug Fixes
    • Improved deprovisioning so cleanup is skipped when the system is not recognized as a host.
    • Made NVMe secure erase more resilient by trying multiple erase modes before proceeding.
    • If no erase mode works, the process now logs a warning and continues instead of failing immediately.

Walkthrough

scrabbing.rs receives two behavioral changes: host detection in run and run_no_api is delegated from a local SMBIOS-based is_host() to platform::is_host(), and NVMe namespace secure-erase is extended from a hard -s2-only attempt to a sequential -s2 → -s1 → -s0 fallback that warns and continues rather than propagating an error when all modes are unsupported.

Changes

NVMe Erase Fallback and Host Gate Refactor

Layer / File(s) Summary
Import cleanup for platform host detection
crates/scout/src/deprovision/scrabbing.rs
Removes CarbideClientResult, IN_QEMU_VM, and SMBIOS-based host detection imports; adds platform module import to support the new host gate.
NVMe secure-erase fallback sequence
crates/scout/src/deprovision/scrabbing.rs
Replaces single nvme format -s2 (with conditional error propagation tied to namespaces_supported) with a sequential retry across -s2, -s1, and -s0; succeeds on the first supported mode, logs a warning and continues if all fail.
Host gate migration to platform::is_host()
crates/scout/src/deprovision/scrabbing.rs
Both run and run_no_api now call platform::is_host() instead of the removed local helper to decide whether deprovision cleanup should proceed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is specific and accurately summarizes the NVMe secure-erase fallback change for issue #2820.
Description check ✅ Passed The description directly matches the implemented NVMe cleanup fallback and failure-tolerance behavior.
Linked Issues check ✅ Passed The change satisfies #2820 by falling back from -s2 to -s1 to -s0 and continuing when no mode is supported.
Out of Scope Changes check ✅ Passed The host-detection refactor appears supportive of the cleanup flow and no unrelated changes are evident.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/scout/src/deprovision/scrabbing.rs`:
- Around line 331-346: The nvme secure-erase fallback in scrabbing.rs is
swallowing every `cmdrun::run_prog` failure inside the `for ses in ["-s2",
"-s1", "-s0"]` loop and only treating it as “unsupported mode,” which can hide
real operational errors. Update the `fmt_ok`/`tracing::warn!` logic in this
`nvme format` path to distinguish unsupported SES responses from permission,
timeout, or device failures: only continue to the next SES when the error
clearly indicates unsupported mode, and otherwise return the failure from the
`nvme` formatting attempt. Use the existing `cmdrun::run_prog` call site and the
`nvme` formatting block as the place to capture and propagate the actual error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a10a2221-67df-47d4-8068-1d0c86256164

📥 Commits

Reviewing files that changed from the base of the PR and between 37b9c70 and 343acef.

📒 Files selected for processing (1)
  • crates/scout/src/deprovision/scrabbing.rs

Comment on lines +331 to +346
// secure erase with fallback: -s2 (crypto) -> -s1 (user-data) -> -s0 (none).
// Some drives reject crypto erase ("Invalid Command Opcode"); fall back, and if no
// mode is supported, log and continue (the OS install overwrites the disk).
let mut fmt_ok = false;
for ses in ["-s2", "-s1", "-s0"] {
if cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid])
.await
.is_ok()
{
fmt_ok = true;
break;
}
}
if !fmt_ok {
tracing::warn!("nvme: no supported format mode on {} ns {}; continuing", nvmename, nsid);
}

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Do not treat every format failure as an unsupported erase mode.

Line 336 discards the actual nvme format error, so permission/device/timeout failures can be downgraded to “unsupported” and cleanup may continue with data still present. Only suppress errors that prove the SES mode is unsupported; return operational failures.

Suggested direction
-            let mut fmt_ok = false;
+            let mut fmt_ok = false;
+            let mut last_unsupported_error = None;
             for ses in ["-s2", "-s1", "-s0"] {
-                if cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid])
-                    .await
-                    .is_ok()
-                {
-                    fmt_ok = true;
-                    break;
+                match cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]).await {
+                    Ok(_) => {
+                        fmt_ok = true;
+                        break;
+                    }
+                    Err(e) if is_unsupported_nvme_format_mode(&e) => {
+                        tracing::debug!(%nvmename, %nsid, %ses, error=%e, "nvme format mode unsupported");
+                        last_unsupported_error = Some(e.to_string());
+                    }
+                    Err(e) => return Err(e),
                 }
             }
             if !fmt_ok {
-                tracing::warn!("nvme: no supported format mode on {} ns {}; continuing", nvmename, nsid);
+                tracing::warn!(%nvmename, %nsid, error=?last_unsupported_error, "nvme: no supported format mode; continuing");
             }

As per coding guidelines, “When writing log messages, prefer placing common fields as attributes passed to tracing functions instead of using string interpolation.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/deprovision/scrabbing.rs` around lines 331 - 346, The nvme
secure-erase fallback in scrabbing.rs is swallowing every `cmdrun::run_prog`
failure inside the `for ses in ["-s2", "-s1", "-s0"]` loop and only treating it
as “unsupported mode,” which can hide real operational errors. Update the
`fmt_ok`/`tracing::warn!` logic in this `nvme format` path to distinguish
unsupported SES responses from permission, timeout, or device failures: only
continue to the next SES when the error clearly indicates unsupported mode, and
otherwise return the failure from the `nvme` formatting attempt. Use the
existing `cmdrun::run_prog` call site and the `nvme` formatting block as the
place to capture and propagate the actual error.

Source: Coding guidelines

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm not sure what to do if we aren't able to format nvme. Suggestions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajf, yes — I’d recommend a simple policy split:

  1. SES mode unsupported (-s2 not supported, etc.) → try the next mode.
  2. All SES modes unsupportedwarn! and continue, since the disk will be overwritten anyway.
  3. Operational failure (I/O error, permission issue, timeout, controller/device problem) → return the error.

That preserves the PR’s intent without masking real device failures.

A minimal shape would be:

fn is_unsupported_nvme_format_mode(error: &CarbideClientError) -> bool {
    let message = error.to_string();

    message.contains("Invalid Command Opcode")
        || message.contains("Invalid Field in Command")
        || message.contains("unsupported")
}

let mut fmt_ok = false;
let mut last_unsupported_error = None;

for ses in ["-s2", "-s1", "-s0"] {
    match cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]).await {
        Ok(_) => {
            fmt_ok = true;
            break;
        }
        Err(e) if is_unsupported_nvme_format_mode(&e) => {
            tracing::debug!(%nvmename, %nsid, %ses, error=%e, "nvme format mode unsupported");
            last_unsupported_error = Some(e.to_string());
        }
        Err(e) => {
            tracing::error!(%nvmename, %nsid, %ses, error=%e, "nvme format failed");
            return Err(e);
        }
    }
}

if !fmt_ok {
    tracing::warn!(
        %nvmename,
        %nsid,
        error = ?last_unsupported_error,
        "no supported nvme format mode; continuing"
    );
}

If you want the most conservative interpretation, this is the one I’d ship. It only downgrades failures that look like capability mismatch, which is exactly the case this PR is trying to fix. Anything else still surfaces as a real cleanup failure.

(If cmdrun::run_prog does not preserve stderr text well enough to classify errors, then the next-best option is to make that helper more explicit first.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the old code returned Err(_) so we should still do that. Looks like the code here could still return Ok()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajf, agreed — that is the key regression.

The previous logic only tolerated format failure when namespaces_supported was true, because the namespace was about to be deleted anyway. In the new loop, if -s2, -s1, and -s0 all fail, we can now fall through and return Ok(()) even when namespaces_supported is false, which means cleanup may report success without any successful erase.

I’d treat this as the minimum-safe fix:

  • keep the -s2 -> -s1 -> -s0 fallback
  • remember the last error
  • if all modes fail:
    • return Err(...) when !namespaces_supported
    • only warn! and continue when namespaces_supported

In other words, the post-loop behavior should preserve the old contract for the non-namespace-management path.

Something along these lines:

let mut fmt_ok = false;
let mut last_error = None;

for ses in ["-s2", "-s1", "-s0"] {
    match cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]).await {
        Ok(_) => {
            fmt_ok = true;
            break;
        }
        Err(e) => {
            tracing::debug!(%nvmename, %nsid, %ses, error=%e, "nvme format failed");
            last_error = Some(e);
        }
    }
}

if !fmt_ok {
    if namespaces_supported {
        tracing::warn!(%nvmename, %nsid, error=?last_error, "all nvme format modes failed; continuing because namespace will be deleted");
    } else {
        return Err(last_error.unwrap_or_else(|| {
            CarbideClientError::GenericError("nvme format failed for all SES modes".to_string())
        }));
    }
}

That keeps the PR objective intact while avoiding the false-Ok() path you called out.

@kirson-git

Copy link
Copy Markdown
Contributor Author

Superseded by pull request create failed: GraphQL: Head sha can't be blank, Base sha can't be blank, No commits between NVIDIA:main and kirson-git:fix-scout-nvme-erase-fallback, Head ref must be a branch (createPullRequest) — same fix, clean diff (the previous branch was based on a stale fork base which made the diff show unrelated upstream refactors).

@kirson-git kirson-git closed this Jun 24, 2026
@kirson-git kirson-git reopened this Jun 24, 2026
@kirson-git

Copy link
Copy Markdown
Contributor Author

Note for reviewers: the diff currently shows unrelated changes (removal of is_host()/smbioslib, platform import) because this PR's fork base is behind upstream's recent refactor and the fork can't be auto-synced (token lacks workflow scope for a .github/workflows change). The 3-way merge against current main is clean, and the only functional change is the NVMe secure-erase fallback (-s2 -> -s1 -> -s0, else log+continue) in crates/scout/src/deprovision/scrabbing.rs. I'll force-rebase onto current main for a pristine diff once the fork is synced.

{
Ok(_) => (),
Err(e) => {
if namespaces_supported {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should validate that it's the error we expect prior to falling back -- I'm envisioning a future where we add params or something and start receiving errors other than the one we see here and then unnecessarily fall back rather than (correctly, imo) hard error so that we can find and fix this.

.is_ok()
{
fmt_ok = true;
break;

@ajf ajf Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more idiomatic (current is very wordy) would be something like

let got_one = ["-s2", "-s1", "-s0"].iter().any(|sess| cmdrun::run_prog(...).await.is_ok() )

(and whatever else for async handling, unless it can't be done of course.)

Since .any() stops on the first true value.

@ajf

ajf commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

@kirson-git make sure to sign your commit with GPG or SSH. Since you're at NVIDIA, join the NVIDIA org so the DCO check will pass without using the Signed-Off-By headers and tests will automatically run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scout: NVMe secure-erase hardcodes -s2 (crypto) with no fallback → NVMECleanFailed on unsupported drives

3 participants