Skip to content

feat: Add machine admin operation endpoints#2849

Draft
kfelternv wants to merge 2 commits into
NVIDIA:mainfrom
kfelternv:admin-core-ops
Draft

feat: Add machine admin operation endpoints#2849
kfelternv wants to merge 2 commits into
NVIDIA:mainfrom
kfelternv:admin-core-ops

Conversation

@kfelternv

@kfelternv kfelternv commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Resolves #2218

@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 950159f3-8eb5-4265-a74d-149dd0983813

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
rest-api/openapi/spec.yaml (3)

9268-9298: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Align the upsert PUT with the CreateOrUpdate naming convention.

This is a PUT whose own description states it will "Add or update a Machine health report override" — i.e. a create-or-replace operation — yet it is named insert-machine-health-report with summary "Insert Machine health report". This diverges from the repository convention and from sibling PUTs in this very file (create-or-update-bmc-credential, create-or-update-tenant-identity-config). The operationId flows into the generated SDK method name, so the inconsistency surfaces to clients.

♻️ Proposed naming alignment
     put:
-      summary: Insert Machine health report
+      summary: Create Or Update Machine health report
       tags:
         - Health Report
-      operationId: insert-machine-health-report
+      operationId: create-or-update-machine-health-report

As per path instructions: "PUT endpoints that create or replace a resource should use CreateOrUpdate naming consistently across handlers, summaries, operation IDs, and generated SDK methods."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/openapi/spec.yaml` around lines 9268 - 9298, The PUT operation for
the machine health report is using insert-style naming even though it is a
create-or-replace endpoint. Update the OpenAPI entry for the machine health
report PUT so its summary and operationId follow the repository’s CreateOrUpdate
convention, matching sibling endpoints like create-or-update-bmc-credential and
create-or-update-tenant-identity-config. Keep the naming consistent across the
spec so the generated SDK method name reflects a create-or-update action rather
than insert-machine-health-report.

Source: Path instructions


9111-9114: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Declare the new operation tags (or fold them into Machine).

These endpoints introduce four new tags — Admin Power (Line 9114), BMC Reset (Line 9158), DPU Reprovision (Line 9202), and Health Report (Lines 9249/9271/9322) — none of which are declared in the top-level tags array (Lines 18-283). The established pattern in this spec groups resource-scoped operations under a single descriptive tag (rack power/firmware/bring-up all sit under Rack; the other machine/{machineId}/... sub-resources sit under Machine). Undeclared tags render without descriptions in Redoc and fragment these closely-related machine admin operations across four orphan groups. Prefer reusing Machine for this cohort, or, if distinct grouping is intended, declare each new tag with a description in the top-level tags array.

As per path instructions: "Review OpenAPI docs and examples for accuracy, deprecation clarity, client-facing compatibility, spelling, and consistency with spec.yaml."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/openapi/spec.yaml` around lines 9111 - 9114, The new machine admin
operations are using undeclared tags, which will create orphan groups in the
docs. Update the affected operations in the OpenAPI spec to reuse the existing
Machine tag, or declare Admin Power, BMC Reset, DPU Reprovision, and Health
Report in the top-level tags array with descriptions so Redoc groups them
consistently. Make the change in the relevant operation entries and the shared
tags section of spec.yaml.

Source: Path instructions


13283-13491: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add request/response examples to the new DTOs.

The newly introduced schemas (AdminPowerControlRequest/Response, BmcResetRequest/Response, DpuReprovisionRequest/Response, and the MachineHealthReport* family) carry no examples, whereas the overwhelming majority of request/response schemas in this spec do. Examples materially improve the Redoc reference and the generated SDK docs; their absence here is a visible quality regression for these admin endpoints.

As per path instructions: verify the OpenAPI spec is updated "with matching schema, required/nullable semantics, and examples when applicable."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/openapi/spec.yaml` around lines 13283 - 13491, The new DTO schemas
for admin power, BMC reset, DPU reprovision, and machine health reports are
missing OpenAPI examples, which makes the generated docs inconsistent with the
rest of the spec. Update the schema definitions in the components section for
AdminPowerControlRequest/AdminPowerControlResponse,
BmcResetRequest/BmcResetResponse, DpuReprovisionRequest/DpuReprovisionResponse,
and the MachineHealthReport* types to include representative examples that match
each schema’s required fields and enum values. Keep the examples aligned with
the existing required/nullable semantics and use the schema names themselves as
the anchor points when editing.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rest-api/api/pkg/api/handler/adminmachine.go`:
- Around line 82-88: Treat failures from common.GetSiteFromIDString in
adminmachine.go as server-side integrity issues when the source is
machine.SiteID from the stored Machine row, not request input. Update the error
handling around GetSiteFromIDString so only true request-derived validation
errors return a bad-request response; if machine.SiteID is missing, invalid, or
the site record is absent, log it as an internal inconsistency and return an
internal-server-error response from the existing handler path.

In `@rest-api/api/pkg/api/handler/bmcreset.go`:
- Around line 52-58: The BmcReset handler currently accepts an empty request
body even though APIBmcResetRequest is documented as required, so align the
runtime behavior with the OpenAPI contract. Update the logic in bmcreset.go’s
request-binding path to reject omitted bodies with a BadRequest, or explicitly
mark the body optional in the API definition if zero-value requests are
intended; ensure the behavior in the handler matches the documented contract and
uses the existing c.Bind/apiReq flow consistently.

In `@rest-api/api/pkg/api/model/healthoverride.go`:
- Around line 231-249: Blank timestamp strings are being treated as missing
values, which lets invalid date-time input pass through. Update
validateOptionalRFC3339 and stringToProtoTime so empty strings are rejected
during validation rather than converted to nil; only nil/omitted values or valid
RFC3339 timestamps should be accepted. Keep the behavior consistent for
observedAt and alerts[].inAlertSince so spec.yaml request semantics stay aligned
with the documented date-time contract.

---

Nitpick comments:
In `@rest-api/openapi/spec.yaml`:
- Around line 9268-9298: The PUT operation for the machine health report is
using insert-style naming even though it is a create-or-replace endpoint. Update
the OpenAPI entry for the machine health report PUT so its summary and
operationId follow the repository’s CreateOrUpdate convention, matching sibling
endpoints like create-or-update-bmc-credential and
create-or-update-tenant-identity-config. Keep the naming consistent across the
spec so the generated SDK method name reflects a create-or-update action rather
than insert-machine-health-report.
- Around line 9111-9114: The new machine admin operations are using undeclared
tags, which will create orphan groups in the docs. Update the affected
operations in the OpenAPI spec to reuse the existing Machine tag, or declare
Admin Power, BMC Reset, DPU Reprovision, and Health Report in the top-level tags
array with descriptions so Redoc groups them consistently. Make the change in
the relevant operation entries and the shared tags section of spec.yaml.
- Around line 13283-13491: The new DTO schemas for admin power, BMC reset, DPU
reprovision, and machine health reports are missing OpenAPI examples, which
makes the generated docs inconsistent with the rest of the spec. Update the
schema definitions in the components section for
AdminPowerControlRequest/AdminPowerControlResponse,
BmcResetRequest/BmcResetResponse, DpuReprovisionRequest/DpuReprovisionResponse,
and the MachineHealthReport* types to include representative examples that match
each schema’s required fields and enum values. Keep the examples aligned with
the existing required/nullable semantics and use the schema names themselves as
the anchor points when editing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bf6fe322-f8b3-4d51-93da-6b5406404c4d

📥 Commits

Reviewing files that changed from the base of the PR and between ea94336 and 950140d.

⛔ Files ignored due to path filters (17)
  • rest-api/sdk/standard/api_admin_power.go is excluded by !rest-api/sdk/standard/api_*.go
  • rest-api/sdk/standard/api_bmc_reset.go is excluded by !rest-api/sdk/standard/api_*.go
  • rest-api/sdk/standard/api_dpu_reprovision.go is excluded by !rest-api/sdk/standard/api_*.go
  • rest-api/sdk/standard/api_health_report.go is excluded by !rest-api/sdk/standard/api_*.go
  • rest-api/sdk/standard/client.go is excluded by !rest-api/sdk/standard/client.go
  • rest-api/sdk/standard/model_admin_power_control_request.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_admin_power_control_response.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_bmc_reset_request.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_bmc_reset_response.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_dpu_reprovision_request.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_dpu_reprovision_response.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_machine_health_report_entry.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_machine_health_report_list_response.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_machine_health_report_probe_alert.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_machine_health_report_probe_success.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_machine_health_report_remove_response.go is excluded by !rest-api/sdk/standard/model_*.go
  • rest-api/sdk/standard/model_machine_health_report_response.go is excluded by !rest-api/sdk/standard/model_*.go
📒 Files selected for processing (14)
  • rest-api/api/pkg/api/handler/adminmachine.go
  • rest-api/api/pkg/api/handler/adminops_test.go
  • rest-api/api/pkg/api/handler/adminpower.go
  • rest-api/api/pkg/api/handler/bmcreset.go
  • rest-api/api/pkg/api/handler/dpureprovision.go
  • rest-api/api/pkg/api/handler/healthoverride.go
  • rest-api/api/pkg/api/model/adminops_test.go
  • rest-api/api/pkg/api/model/adminpower.go
  • rest-api/api/pkg/api/model/bmcreset.go
  • rest-api/api/pkg/api/model/dpureprovision.go
  • rest-api/api/pkg/api/model/healthoverride.go
  • rest-api/api/pkg/api/routes.go
  • rest-api/api/pkg/api/routes_test.go
  • rest-api/openapi/spec.yaml

Comment on lines +82 to +88
site, err := common.GetSiteFromIDString(ctx, nil, machine.SiteID.String(), b.dbSession)
if err != nil {
if errors.Is(err, cdb.ErrDoesNotExist) || errors.Is(err, common.ErrInvalidID) {
return nil, "", nil, cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Machine Site does not exist", nil)
}
logger.Error().Err(err).Msg("error retrieving Machine Site from DB")
return nil, "", nil, cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Failed to retrieve Machine Site due to DB error", nil)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Treat an invalid machine.SiteID as a server-side integrity failure.

machine.SiteID comes from the stored Machine row, not from the request. If that site is missing or unparsable, the machine record is inconsistent; returning 400 misclassifies a broken DB relationship as caller error and hides an internal data-integrity problem.

Suggested fix
 	site, err := common.GetSiteFromIDString(ctx, nil, machine.SiteID.String(), b.dbSession)
 	if err != nil {
 		if errors.Is(err, cdb.ErrDoesNotExist) || errors.Is(err, common.ErrInvalidID) {
-			return nil, "", nil, cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Machine Site does not exist", nil)
+			logger.Error().Err(err).
+				Str("machineID", machineID).
+				Str("siteID", machine.SiteID.String()).
+				Msg("machine references an invalid Site")
+			return nil, "", nil, cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Machine has an invalid Site association", nil)
 		}
 		logger.Error().Err(err).Msg("error retrieving Machine Site from DB")
 		return nil, "", nil, cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Failed to retrieve Machine Site due to DB error", nil)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
site, err := common.GetSiteFromIDString(ctx, nil, machine.SiteID.String(), b.dbSession)
if err != nil {
if errors.Is(err, cdb.ErrDoesNotExist) || errors.Is(err, common.ErrInvalidID) {
return nil, "", nil, cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Machine Site does not exist", nil)
}
logger.Error().Err(err).Msg("error retrieving Machine Site from DB")
return nil, "", nil, cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Failed to retrieve Machine Site due to DB error", nil)
site, err := common.GetSiteFromIDString(ctx, nil, machine.SiteID.String(), b.dbSession)
if err != nil {
if errors.Is(err, cdb.ErrDoesNotExist) || errors.Is(err, common.ErrInvalidID) {
logger.Error().Err(err).
Str("machineID", machineID).
Str("siteID", machine.SiteID.String()).
Msg("machine references an invalid Site")
return nil, "", nil, cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Machine has an invalid Site association", nil)
}
logger.Error().Err(err).Msg("error retrieving Machine Site from DB")
return nil, "", nil, cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Failed to retrieve Machine Site due to DB error", nil)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/api/pkg/api/handler/adminmachine.go` around lines 82 - 88, Treat
failures from common.GetSiteFromIDString in adminmachine.go as server-side
integrity issues when the source is machine.SiteID from the stored Machine row,
not request input. Update the error handling around GetSiteFromIDString so only
true request-derived validation errors return a bad-request response; if
machine.SiteID is missing, invalid, or the site record is absent, log it as an
internal inconsistency and return an internal-server-error response from the
existing handler path.

Comment on lines +52 to +58
if c.Request().ContentLength != 0 {
if err := c.Bind(&apiReq); err != nil {
if !errors.Is(err, io.EOF) {
return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Invalid request body", nil)
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject an omitted body or mark the request body optional in the contract.

This endpoint documents APIBmcResetRequest as required, but the current branch accepts an empty body and treats it as the zero-value request. That makes the wire behavior looser than the advertised REST contract.

Suggested fix
 	var apiReq model.APIBmcResetRequest
-	if c.Request().ContentLength != 0 {
-		if err := c.Bind(&apiReq); err != nil {
-			if !errors.Is(err, io.EOF) {
-				return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Invalid request body", nil)
-			}
-		}
+	if err := c.Bind(&apiReq); err != nil {
+		if errors.Is(err, io.EOF) {
+			return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Request body is required", nil)
+		}
+		return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Invalid request body", nil)
 	}
As per path instructions, review REST API server changes for validation and consistency with the OpenAPI specification.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if c.Request().ContentLength != 0 {
if err := c.Bind(&apiReq); err != nil {
if !errors.Is(err, io.EOF) {
return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Invalid request body", nil)
}
}
}
var apiReq model.APIBmcResetRequest
if err := c.Bind(&apiReq); err != nil {
if errors.Is(err, io.EOF) {
return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Request body is required", nil)
}
return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Invalid request body", nil)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/api/pkg/api/handler/bmcreset.go` around lines 52 - 58, The BmcReset
handler currently accepts an empty request body even though APIBmcResetRequest
is documented as required, so align the runtime behavior with the OpenAPI
contract. Update the logic in bmcreset.go’s request-binding path to reject
omitted bodies with a BadRequest, or explicitly mark the body optional in the
API definition if zero-value requests are intended; ensure the behavior in the
handler matches the documented contract and uses the existing c.Bind/apiReq flow
consistently.

Source: Path instructions

Comment on lines +231 to +249
func validateOptionalRFC3339(value *string, field string) error {
if value == nil || *value == "" {
return nil
}
if _, err := time.Parse(time.RFC3339Nano, *value); err != nil {
return fmt.Errorf("%s must be RFC3339 timestamp: %w", field, err)
}
return nil
}

func stringToProtoTime(value *string) *timestamppb.Timestamp {
if value == nil || *value == "" {
return nil
}
t, err := time.Parse(time.RFC3339Nano, *value)
if err != nil {
return nil
}
return timestamppb.New(t)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Reject blank timestamp strings instead of silently dropping them.

validateOptionalRFC3339 currently accepts "", and stringToProtoTime then turns that same value into nil. A client can therefore send observedAt: "" or alerts[].inAlertSince: "", get a 200, and still be outside the documented date-time contract while Core never receives the field. Please make blank strings fail validation so the only accepted states are omitted/null or a real RFC3339 timestamp.

Suggested fix
 func validateOptionalRFC3339(value *string, field string) error {
-	if value == nil || *value == "" {
+	if value == nil {
 		return nil
 	}
+	if *value == "" {
+		return fmt.Errorf("%s must be RFC3339 timestamp", field)
+	}
 	if _, err := time.Parse(time.RFC3339Nano, *value); err != nil {
 		return fmt.Errorf("%s must be RFC3339 timestamp: %w", field, err)
 	}
 	return nil
 }
 
 func stringToProtoTime(value *string) *timestamppb.Timestamp {
-	if value == nil || *value == "" {
+	if value == nil {
 		return nil
 	}
 	t, err := time.Parse(time.RFC3339Nano, *value)
 	if err != nil {
 		return nil

As per path instructions, model changes must keep spec.yaml request semantics aligned with the documented RFC3339/date-time fields.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func validateOptionalRFC3339(value *string, field string) error {
if value == nil || *value == "" {
return nil
}
if _, err := time.Parse(time.RFC3339Nano, *value); err != nil {
return fmt.Errorf("%s must be RFC3339 timestamp: %w", field, err)
}
return nil
}
func stringToProtoTime(value *string) *timestamppb.Timestamp {
if value == nil || *value == "" {
return nil
}
t, err := time.Parse(time.RFC3339Nano, *value)
if err != nil {
return nil
}
return timestamppb.New(t)
func validateOptionalRFC3339(value *string, field string) error {
if value == nil {
return nil
}
if *value == "" {
return fmt.Errorf("%s must be RFC3339 timestamp", field)
}
if _, err := time.Parse(time.RFC3339Nano, *value); err != nil {
return fmt.Errorf("%s must be RFC3339 timestamp: %w", field, err)
}
return nil
}
func stringToProtoTime(value *string) *timestamppb.Timestamp {
if value == nil {
return nil
}
t, err := time.Parse(time.RFC3339Nano, *value)
if err != nil {
return nil
}
return timestamppb.New(t)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/api/pkg/api/model/healthoverride.go` around lines 231 - 249, Blank
timestamp strings are being treated as missing values, which lets invalid
date-time input pass through. Update validateOptionalRFC3339 and
stringToProtoTime so empty strings are rejected during validation rather than
converted to nil; only nil/omitted values or valid RFC3339 timestamps should be
accepted. Keep the behavior consistent for observedAt and alerts[].inAlertSince
so spec.yaml request semantics stay aligned with the documented date-time
contract.

Source: Path instructions

@kfelternv kfelternv mentioned this pull request Jun 24, 2026
2 tasks
Co-authored-by: Kyle Felter <kfelternv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add REST API support for state-changing admin ops: power, bmc-reset, dpu-reprovision, health-override

1 participant