fix(scout): fall back NVMe secure-erase -s2->-s1->-s0 (fixes #2820)#2835
fix(scout): fall back NVMe secure-erase -s2->-s1->-s0 (fixes #2820)#2835kirson-git wants to merge 1 commit into
Conversation
Summary by CodeRabbit
Walkthrough
ChangesNVMe Erase Fallback and Host Gate Refactor
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/scout/src/deprovision/scrabbing.rs`:
- Around line 331-346: The nvme secure-erase fallback in scrabbing.rs is
swallowing every `cmdrun::run_prog` failure inside the `for ses in ["-s2",
"-s1", "-s0"]` loop and only treating it as “unsupported mode,” which can hide
real operational errors. Update the `fmt_ok`/`tracing::warn!` logic in this
`nvme format` path to distinguish unsupported SES responses from permission,
timeout, or device failures: only continue to the next SES when the error
clearly indicates unsupported mode, and otherwise return the failure from the
`nvme` formatting attempt. Use the existing `cmdrun::run_prog` call site and the
`nvme` formatting block as the place to capture and propagate the actual error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a10a2221-67df-47d4-8068-1d0c86256164
📒 Files selected for processing (1)
crates/scout/src/deprovision/scrabbing.rs
| // secure erase with fallback: -s2 (crypto) -> -s1 (user-data) -> -s0 (none). | ||
| // Some drives reject crypto erase ("Invalid Command Opcode"); fall back, and if no | ||
| // mode is supported, log and continue (the OS install overwrites the disk). | ||
| let mut fmt_ok = false; | ||
| for ses in ["-s2", "-s1", "-s0"] { | ||
| if cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]) | ||
| .await | ||
| .is_ok() | ||
| { | ||
| fmt_ok = true; | ||
| break; | ||
| } | ||
| } | ||
| if !fmt_ok { | ||
| tracing::warn!("nvme: no supported format mode on {} ns {}; continuing", nvmename, nsid); | ||
| } |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Do not treat every format failure as an unsupported erase mode.
Line 336 discards the actual nvme format error, so permission/device/timeout failures can be downgraded to “unsupported” and cleanup may continue with data still present. Only suppress errors that prove the SES mode is unsupported; return operational failures.
Suggested direction
- let mut fmt_ok = false;
+ let mut fmt_ok = false;
+ let mut last_unsupported_error = None;
for ses in ["-s2", "-s1", "-s0"] {
- if cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid])
- .await
- .is_ok()
- {
- fmt_ok = true;
- break;
+ match cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]).await {
+ Ok(_) => {
+ fmt_ok = true;
+ break;
+ }
+ Err(e) if is_unsupported_nvme_format_mode(&e) => {
+ tracing::debug!(%nvmename, %nsid, %ses, error=%e, "nvme format mode unsupported");
+ last_unsupported_error = Some(e.to_string());
+ }
+ Err(e) => return Err(e),
}
}
if !fmt_ok {
- tracing::warn!("nvme: no supported format mode on {} ns {}; continuing", nvmename, nsid);
+ tracing::warn!(%nvmename, %nsid, error=?last_unsupported_error, "nvme: no supported format mode; continuing");
}As per coding guidelines, “When writing log messages, prefer placing common fields as attributes passed to tracing functions instead of using string interpolation.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/scout/src/deprovision/scrabbing.rs` around lines 331 - 346, The nvme
secure-erase fallback in scrabbing.rs is swallowing every `cmdrun::run_prog`
failure inside the `for ses in ["-s2", "-s1", "-s0"]` loop and only treating it
as “unsupported mode,” which can hide real operational errors. Update the
`fmt_ok`/`tracing::warn!` logic in this `nvme format` path to distinguish
unsupported SES responses from permission, timeout, or device failures: only
continue to the next SES when the error clearly indicates unsupported mode, and
otherwise return the failure from the `nvme` formatting attempt. Use the
existing `cmdrun::run_prog` call site and the `nvme` formatting block as the
place to capture and propagate the actual error.
Source: Coding guidelines
There was a problem hiding this comment.
yeah I'm not sure what to do if we aren't able to format nvme. Suggestions?
There was a problem hiding this comment.
@ajf, yes — I’d recommend a simple policy split:
- SES mode unsupported (
-s2not supported, etc.) → try the next mode. - All SES modes unsupported →
warn!and continue, since the disk will be overwritten anyway. - Operational failure (I/O error, permission issue, timeout, controller/device problem) → return the error.
That preserves the PR’s intent without masking real device failures.
A minimal shape would be:
fn is_unsupported_nvme_format_mode(error: &CarbideClientError) -> bool {
let message = error.to_string();
message.contains("Invalid Command Opcode")
|| message.contains("Invalid Field in Command")
|| message.contains("unsupported")
}
let mut fmt_ok = false;
let mut last_unsupported_error = None;
for ses in ["-s2", "-s1", "-s0"] {
match cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]).await {
Ok(_) => {
fmt_ok = true;
break;
}
Err(e) if is_unsupported_nvme_format_mode(&e) => {
tracing::debug!(%nvmename, %nsid, %ses, error=%e, "nvme format mode unsupported");
last_unsupported_error = Some(e.to_string());
}
Err(e) => {
tracing::error!(%nvmename, %nsid, %ses, error=%e, "nvme format failed");
return Err(e);
}
}
}
if !fmt_ok {
tracing::warn!(
%nvmename,
%nsid,
error = ?last_unsupported_error,
"no supported nvme format mode; continuing"
);
}If you want the most conservative interpretation, this is the one I’d ship. It only downgrades failures that look like capability mismatch, which is exactly the case this PR is trying to fix. Anything else still surfaces as a real cleanup failure.
(If cmdrun::run_prog does not preserve stderr text well enough to classify errors, then the next-best option is to make that helper more explicit first.)
There was a problem hiding this comment.
Looks like the old code returned Err(_) so we should still do that. Looks like the code here could still return Ok()
There was a problem hiding this comment.
@ajf, agreed — that is the key regression.
The previous logic only tolerated format failure when namespaces_supported was true, because the namespace was about to be deleted anyway. In the new loop, if -s2, -s1, and -s0 all fail, we can now fall through and return Ok(()) even when namespaces_supported is false, which means cleanup may report success without any successful erase.
I’d treat this as the minimum-safe fix:
- keep the
-s2 -> -s1 -> -s0fallback - remember the last error
- if all modes fail:
return Err(...)when!namespaces_supported- only
warn!and continue whennamespaces_supported
In other words, the post-loop behavior should preserve the old contract for the non-namespace-management path.
Something along these lines:
let mut fmt_ok = false;
let mut last_error = None;
for ses in ["-s2", "-s1", "-s0"] {
match cmdrun::run_prog(NVME_CLI_PROG, ["format", nvmename, ses, "-f", "-n", nsid]).await {
Ok(_) => {
fmt_ok = true;
break;
}
Err(e) => {
tracing::debug!(%nvmename, %nsid, %ses, error=%e, "nvme format failed");
last_error = Some(e);
}
}
}
if !fmt_ok {
if namespaces_supported {
tracing::warn!(%nvmename, %nsid, error=?last_error, "all nvme format modes failed; continuing because namespace will be deleted");
} else {
return Err(last_error.unwrap_or_else(|| {
CarbideClientError::GenericError("nvme format failed for all SES modes".to_string())
}));
}
}That keeps the PR objective intact while avoiding the false-Ok() path you called out.
|
Superseded by pull request create failed: GraphQL: Head sha can't be blank, Base sha can't be blank, No commits between NVIDIA:main and kirson-git:fix-scout-nvme-erase-fallback, Head ref must be a branch (createPullRequest) — same fix, clean diff (the previous branch was based on a stale fork base which made the diff show unrelated upstream refactors). |
|
Note for reviewers: the diff currently shows unrelated changes (removal of |
| { | ||
| Ok(_) => (), | ||
| Err(e) => { | ||
| if namespaces_supported { |
There was a problem hiding this comment.
We should validate that it's the error we expect prior to falling back -- I'm envisioning a future where we add params or something and start receiving errors other than the one we see here and then unnecessarily fall back rather than (correctly, imo) hard error so that we can find and fix this.
| .is_ok() | ||
| { | ||
| fmt_ok = true; | ||
| break; |
There was a problem hiding this comment.
more idiomatic (current is very wordy) would be something like
let got_one = ["-s2", "-s1", "-s0"].iter().any(|sess| cmdrun::run_prog(...).await.is_ok() )(and whatever else for async handling, unless it can't be done of course.)
Since .any() stops on the first true value.
|
@kirson-git make sure to sign your commit with GPG or SSH. Since you're at NVIDIA, join the NVIDIA org so the DCO check will pass without using the |
Problem (fixes #2820)
Host provisioning fails at disk cleanup with
Failed/NVMECleanFailedon drives that don't support cryptographic secure-erase. scout hardcodesnvme format <dev> -s2(SES=2, crypto) with no fallback; such drives returnInvalid Command Opcodeand the whole cleanup aborts.Observed on Dell XE9680 (NICo v0.10.3):
The drives are healthy (iDRAC: all NVMe Health=OK) — they simply don't implement crypto-erase.
Fix
In
crates/scout/src/deprovision/scrabbing.rs(clean_this_nvme), try erase modes in order-s2(crypto) ->-s1(user-data) ->-s0(none) and use the first the drive accepts. If none is supported, log a warning and continue rather than failing the whole provisioning (the subsequent OS install overwrites the disk).Validation
Patch built into a custom scout image and re-run on the affected lab host. (Field-validated; CI build covers compilation.)