Skip to content

feat: helm chart for nico machine-a-tron#2764

Draft
akorobkov-nvda wants to merge 13 commits into
NVIDIA:mainfrom
akorobkov-nvda:feat/helm-nico-machine-a-tron
Draft

feat: helm chart for nico machine-a-tron#2764
akorobkov-nvda wants to merge 13 commits into
NVIDIA:mainfrom
akorobkov-nvda:feat/helm-nico-machine-a-tron

Conversation

@akorobkov-nvda

@akorobkov-nvda akorobkov-nvda commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Add a Helm chart for deploying NICo machine-a-tron as a K8s service. The machine-a-tron simulates bare-metal machines for NICo development and testing environments.

Features:

  • Deploy machine-a-tron as a K8s deployment
  • Support multiple named machine groups with different hardware configurations (GB300, DGX H100, power shelves, switches)
  • Graceful shutdown with configurable termination grace period (60s default)
  • TLS/SPIFFE certificate integration via cert-manager
  • Prometheus metrics endpoint and ServiceMonitor support
  • Optional persistent storage for machine state across restarts

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Testing Details

  • helm lint passes
  • helm template renders valid Kubernetes manifests
  • ConfigMap generates correct mat.toml configuration
  • Multiple machine groups render correctly with different hardware types

Additional Notes

Cleanup of deprecated config options

Removes deprecated options from dev/deployment/devspace/machine-a-tron.yaml:

  • use_pxe_api
  • bmc_mock_host_tar
  • bmc_mock_dpu_tar

NICo Site Configuration Required

For the mock to work, NICo must be configured with:

[site_explorer]
override_target_port = 1266
override_target_host = "nico-machine-a-tron-bmc-mock"

Example Usage

helm install mat ./helm/charts/nico-machine-a-tron \
  --set machines.dell-hosts.hostCount=10 \
  --set machines.dell-hosts.hwType=supermicro_gb300_nvl

@akorobkov-nvda akorobkov-nvda requested review from a team as code owners June 22, 2026 20:02
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 802082c2-1b84-4b0f-ac91-5038619c2798

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

A new nico-machine-a-tron Helm chart is introduced, providing all Kubernetes resources (Deployment, ConfigMap, Services, Certificate, RBAC, PVC, ServiceMonitor) and a complete values.yaml and helper templates to deploy a BMC mock machine simulator. The devspace machine-a-tron.yaml ConfigMap is separately updated to replace PXE/tar-file mock configuration with BMC mock port, single-mock mode, SSH mock server, persistence, and expected-machine registration settings.

Changes

nico-machine-a-tron Helm Chart

Layer / File(s) Summary
Chart metadata, values, and helper templates
helm/charts/nico-machine-a-tron/Chart.yaml, helm/charts/nico-machine-a-tron/values.yaml, helm/charts/nico-machine-a-tron/templates/_helpers.tpl
Establishes chart identity, all deployment/runtime defaults (image, ports, probes, Vault env, machineATron options, machine groups, persistence, external service), and reusable helpers for naming, image selection, labels, certificate spec, and ServiceMonitor spec.
ConfigMap and mat.toml generation
helm/charts/nico-machine-a-tron/templates/configmap.yaml, helm/charts/nico-machine-a-tron/tests/configmap_test.yaml
Defines a ConfigMap whose mat.toml is either user-supplied via configFiles.matConfig or rendered from values with conditional fields and a range loop over machine groups; three Helm unit tests cover defaults, value overrides, and the override passthrough.
ServiceAccount, RBAC, and cert-manager Certificate
helm/charts/nico-machine-a-tron/templates/service-account.yaml, helm/charts/nico-machine-a-tron/templates/rbac.yaml, helm/charts/nico-machine-a-tron/templates/certificate.yaml, helm/charts/nico-machine-a-tron/tests/certificate_test.yaml
Adds ServiceAccount, a namespace-scoped Role/RoleBinding granting cert-manager certificaterequest creation, and a Certificate with SPIFFE URI and generated DNS names; two Helm unit tests validate default and custom certificate rendering.
Deployment
helm/charts/nico-machine-a-tron/templates/deployment.yaml
Defines the Deployment with Recreate strategy, container image/command/environment wiring (including Vault credentials and extra env), named ports, TCP liveness/readiness probes, volumeMounts for config/SPIFFE/secrets/optional PVC, and matching pod-level volumes.
Services and PersistentVolumeClaim
helm/charts/nico-machine-a-tron/templates/service.yaml, helm/charts/nico-machine-a-tron/templates/metrics-service.yaml, helm/charts/nico-machine-a-tron/templates/external-service.yaml, helm/charts/nico-machine-a-tron/templates/pvc.yaml
Adds the bmc-mock Service (always redfish, conditionally ssh), metrics Service, conditionally rendered external LoadBalancer Service, and conditionally rendered PVC for persisted machine state.
ServiceMonitor and documentation
helm/charts/nico-machine-a-tron/templates/service-monitor.yaml, helm/charts/nico-machine-a-tron/README.md
Adds a conditionally emitted ServiceMonitor resource wired to the metrics Service; the README covers prerequisites, installation, configuration options, machine group setup, hwType reference, NICo network/site config, graceful shutdown, persistence, external access, and monitoring.

DevSpace ConfigMap Update

Layer / File(s) Summary
DevSpace mat.toml mock config update
dev/deployment/devspace/machine-a-tron.yaml
Replaces use_pxe_api and bmc_mock_*_tar fields with bmc_mock_port, use_single_bmc_mock, SSH mock server settings, persist_dir, and register_expected_machines in the devspace ConfigMap.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary change: introducing a Helm chart for NICo machine-a-tron deployment.
Description check ✅ Passed The description comprehensively explains the feature, testing approach, breaking changes, and provides contextual requirements.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@akorobkov-nvda akorobkov-nvda requested review from poroh and yoks June 22, 2026 20:03
@akorobkov-nvda akorobkov-nvda marked this pull request as draft June 22, 2026 20:03
@copy-pr-bot

copy-pr-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
helm/charts/nico-machine-a-tron/tests/certificate_test.yaml (1)

13-22: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Custom-certificate test is missing behavior assertions.

This test sets custom values but does not assert rendered spec fields, so contract regressions pass unnoticed.

Add focused assertions for custom identity/DNS rendering
   - it: should use custom certificate settings
+    release:
+      namespace: nico-system
     set:
       certificate:
         serviceName: custom-service
         extraDnsNames:
           - extra.example.com
     asserts:
       - isKind:
           of: Certificate
+      - equal:
+          path: spec.commonName
+          value: custom-service.nico-system.svc.cluster.local
+      - contains:
+          path: spec.dnsNames
+          content: custom-service.nico-system.svc.cluster.local
+      - contains:
+          path: spec.dnsNames
+          content: extra.example.com
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm/charts/nico-machine-a-tron/tests/certificate_test.yaml` around lines 13
- 22, The test "should use custom certificate settings" only asserts the
resource kind but does not verify that the custom certificate values
(serviceName and extraDnsNames) are actually rendered in the spec. Add
assertions to check the rendered spec fields such as dnsNames and commonName to
confirm that the custom serviceName and extraDnsNames values are properly
included in the generated Certificate resource. This will ensure that changes to
certificate rendering logic are caught by the test.
🧹 Nitpick comments (3)
helm/charts/nico-machine-a-tron/README.md (2)

119-119: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Remove or contextualize internal crate path reference.

Line 119 references crates/bmc-mock/src/lib.rs as the source of the HostHardwareType enum. This is an implementation detail and exposes tight coupling to the internal crate structure in user-facing documentation. The hwType values are part of the public API contract, but the source location should not be documented here.

Either:

  1. Remove the reference entirely — the list of valid values is self-contained and sufficient.
  2. Reference official API or configuration documentation instead of the crate path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm/charts/nico-machine-a-tron/README.md` at line 119, The documentation at
line 119 in the Helm chart README exposes an internal implementation detail by
referencing the crate path `crates/bmc-mock/src/lib.rs` for the
`HostHardwareType` enum. Since the supported hwType values are part of the
public API contract, either remove the parenthetical reference to the crate path
entirely (as the list of valid values is self-contained and sufficient for
users), or replace it with a reference to official API or configuration
documentation. Do not reference internal crate structure paths in user-facing
Helm chart documentation.

Source: Path instructions


38-48: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Document missing configuration options in the table.

The configuration table omits several important options defined in values.yaml and rendered in the ConfigMap (e.g., configmap.yaml snippet shows cleanup_on_quit, persist_dir, register_expected_machines):

  • machineATron.mockBmcSshServer — Enable/disable mock SSH server (boolean)
  • machineATron.mockBmcSshPort — Port for mock SSH (default: 2222)
  • machineATron.registerExpectedMachines — Auto-register machines as expected (default: true, used in configmap.yaml)
  • machineATron.configureBmcProxyHost — Optional BMC proxy host configuration
  • machineATron.hostBmcPassword / machineATron.dpuBmcPassword — Optional password overrides
  • persistence.storageClass and persistence.size (referenced at lines 165–166 but not documented in table)

Add these to the table for completeness, or move optional/advanced options to a separate "Advanced Configuration" section to keep the primary table focused.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm/charts/nico-machine-a-tron/README.md` around lines 38 - 48, The
configuration parameters table is incomplete and missing several important
options that are defined in values.yaml and used in the ConfigMap. Add the
following missing parameters to the documentation table:
machineATron.mockBmcSshServer, machineATron.mockBmcSshPort,
machineATron.registerExpectedMachines, machineATron.configureBmcProxyHost,
machineATron.hostBmcPassword, machineATron.dpuBmcPassword,
persistence.storageClass, and persistence.size. For each option, include the
Parameter name, a clear Description, and the Default value. Alternatively, you
can create a separate "Advanced Configuration" section in the README to separate
optional or less commonly used parameters from the core configuration table to
maintain focus.
helm/charts/nico-machine-a-tron/Chart.yaml (1)

6-6: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Pin appVersion to an immutable release value.

appVersion: "latest" on Line 6 makes release metadata non-deterministic and harder to audit across upgrades.

Suggested change
-appVersion: "latest"
+appVersion: "0.1.0"

As per path instructions, "Review chart metadata for dependency/version consistency, appVersion correctness, and compatibility with related values and templates."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm/charts/nico-machine-a-tron/Chart.yaml` at line 6, The appVersion field
in the Chart.yaml file is currently set to "latest" which is non-deterministic
and makes release auditing difficult. Replace the appVersion value from "latest"
to a specific immutable release version number (such as a semantic version like
"1.0.0" or the actual release version being packaged) to ensure consistent and
auditable deployments across upgrades.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helm/charts/nico-machine-a-tron/README.md`:
- Around line 26-28: The helm install example in the README.md file uses
incorrect values path syntax for the helm command. Update the --set flags to use
the correct chart values structure which follows the pattern
machines.<name>.<property> where <name> is a user-defined machine group name
(such as dell-hosts or dgx-hosts). Replace machines.config.hostCount with
machines.<groupname>.hostCount and machines.config.dpuPerHostCount with
machines.<groupname>.dpuPerHostCount. Either provide a concrete example with a
specific group name or add a clarifying comment explaining that users should
replace the placeholder group name according to their configuration.
- Line 139: The hardcoded service name `nico-machine-a-tron-bmc-mock` in the
override_target_host configuration at line 139 does not account for users who
customize the nameOverride value in their Helm values. Add documentation
explaining that the service name follows the pattern
`<chart-name-or-nameOverride>-bmc-mock` where the chart name is generated by the
name helper template that respects nameOverride, and include guidance
instructing users to verify their actual service name by running `kubectl get
svc` if they customize the deployment parameters.

In `@helm/charts/nico-machine-a-tron/templates/_helpers.tpl`:
- Around line 27-29: The conditional block that checks `.Values.global.image`
does not validate that both the `repository` and `tag` subfields are present
before composing the image reference. Add additional validation within the else
if block to ensure both `.Values.global.image.repository` and
`.Values.global.image.tag` are defined and non-empty before rendering the image
string on lines 27-28. If either field is missing, the template should fall
through to the else clause or handle it appropriately to avoid rendering an
invalid image reference that could cause deployment failures.

In `@helm/charts/nico-machine-a-tron/templates/certificate.yaml`:
- Around line 9-31: The certificate.yaml template is hardcoding DNS names and
SPIFFE URIs instead of using the exposed configuration values from the values
contract. Replace the hardcoded printf patterns in the commonName, dnsNames (the
main entries before the extraDnsNames loop), and uris (the main entry before the
extraUris loop) sections with references to the exposed configuration values:
serviceName, spiffeServiceName, identityNamespace, dnsNames, and uris from
.Values.certificate. If a shared helper function exists for certificate
configuration, use it to construct these values consistently across the template
rather than duplicating the logic inline.

In `@helm/charts/nico-machine-a-tron/templates/deployment.yaml`:
- Around line 31-107: The deployment template lacks explicit securityContext
configurations for both pod and container levels, relying on permissive runtime
defaults. Add a podSecurityContext section at the spec level (before the
containers array) and a containerSecurityContext section within the container
definition for nico-machine-a-tron (after imagePullPolicy or before command).
Reference these from .Values.podSecurityContext and
.Values.containerSecurityContext respectively to allow overrides, and ensure
they include hardened defaults such as runAsNonRoot, runAsUser,
readOnlyRootFilesystem, allowPrivilegeEscalation set to false, and appropriate
capabilities drops.

In `@helm/charts/nico-machine-a-tron/templates/external-service.yaml`:
- Around line 19-21: The external Service port configuration for the bmc-mock
port is missing an explicit targetPort specification, which creates an implicit
coupling between the service port value and the pod's actual listening port.
This diverges from the pattern used in the internal Service (service.yaml) which
specifies targetPort as a named port. Add a targetPort field to the bmc-mock
port configuration in external-service.yaml that explicitly references the named
pod port (redfish) to maintain consistency with the internal Service pattern and
ensure traffic routing remains correct if the service port values are ever
changed.

In `@helm/charts/nico-machine-a-tron/templates/rbac.yaml`:
- Around line 9-11: Review whether the application in the Deployment (lines
24-27) actually requires cert-manager certificate request creation capabilities.
If cert-manager integration is not needed, remove the entire apiGroups entry for
cert-manager.io with resources certificaterequests and verbs create from the
Role definition. If cert-manager is required, implement explicit opt-in controls
or gating mechanisms to restrict this capability rather than granting blanket
create permissions, and ensure the pod token mount in the Deployment is
protected with appropriate security contexts and network policies to mitigate
the risk of a compromised pod requesting arbitrary certificates.

In `@helm/charts/nico-machine-a-tron/tests/configmap_test.yaml`:
- Around line 12-15: The matchRegex assertion in the configmap test expects a
pattern of "host_count = 3" but the chart defaults actually render "host_count =
10". Update the pattern value in the matchRegex block where path is
data["mat.toml"] to expect "host_count = 10" instead of "host_count = 3" to
align the test with the actual default behavior of the chart.

In `@helm/charts/nico-machine-a-tron/values.yaml`:
- Around line 47-50: The BMC mock port configuration is duplicated in two
locations: service.bmcMock.port and machineATron.bmcMockPort, which can cause
drift when one is overridden. Consolidate to use service.bmcMock.port as the
single source of truth and update the TOML generation logic to read the port
value from service.bmcMock.port instead of reading from
machineATron.bmcMockPort. Remove the machineATron.bmcMockPort definition to
prevent future inconsistencies between service wiring and config generation.
- Line 30: The automountServiceAccountToken field in the Helm values.yaml file
is currently set to true, which automatically mounts Kubernetes API credentials
into the pod and unnecessarily increases the security exposure. Change
automountServiceAccountToken from true to false to follow the principle of least
privilege. This requires pods to explicitly opt-in to mounting service account
tokens rather than having them mounted by default, reducing the risk of
credential exposure.

---

Outside diff comments:
In `@helm/charts/nico-machine-a-tron/tests/certificate_test.yaml`:
- Around line 13-22: The test "should use custom certificate settings" only
asserts the resource kind but does not verify that the custom certificate values
(serviceName and extraDnsNames) are actually rendered in the spec. Add
assertions to check the rendered spec fields such as dnsNames and commonName to
confirm that the custom serviceName and extraDnsNames values are properly
included in the generated Certificate resource. This will ensure that changes to
certificate rendering logic are caught by the test.

---

Nitpick comments:
In `@helm/charts/nico-machine-a-tron/Chart.yaml`:
- Line 6: The appVersion field in the Chart.yaml file is currently set to
"latest" which is non-deterministic and makes release auditing difficult.
Replace the appVersion value from "latest" to a specific immutable release
version number (such as a semantic version like "1.0.0" or the actual release
version being packaged) to ensure consistent and auditable deployments across
upgrades.

In `@helm/charts/nico-machine-a-tron/README.md`:
- Line 119: The documentation at line 119 in the Helm chart README exposes an
internal implementation detail by referencing the crate path
`crates/bmc-mock/src/lib.rs` for the `HostHardwareType` enum. Since the
supported hwType values are part of the public API contract, either remove the
parenthetical reference to the crate path entirely (as the list of valid values
is self-contained and sufficient for users), or replace it with a reference to
official API or configuration documentation. Do not reference internal crate
structure paths in user-facing Helm chart documentation.
- Around line 38-48: The configuration parameters table is incomplete and
missing several important options that are defined in values.yaml and used in
the ConfigMap. Add the following missing parameters to the documentation table:
machineATron.mockBmcSshServer, machineATron.mockBmcSshPort,
machineATron.registerExpectedMachines, machineATron.configureBmcProxyHost,
machineATron.hostBmcPassword, machineATron.dpuBmcPassword,
persistence.storageClass, and persistence.size. For each option, include the
Parameter name, a clear Description, and the Default value. Alternatively, you
can create a separate "Advanced Configuration" section in the README to separate
optional or less commonly used parameters from the core configuration table to
maintain focus.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eb4a082a-caf1-4431-813b-b13ff963d2df

📥 Commits

Reviewing files that changed from the base of the PR and between 81ce98a and 525efe4.

📒 Files selected for processing (17)
  • dev/deployment/devspace/machine-a-tron.yaml
  • helm/charts/nico-machine-a-tron/Chart.yaml
  • helm/charts/nico-machine-a-tron/README.md
  • helm/charts/nico-machine-a-tron/templates/_helpers.tpl
  • helm/charts/nico-machine-a-tron/templates/certificate.yaml
  • helm/charts/nico-machine-a-tron/templates/configmap.yaml
  • helm/charts/nico-machine-a-tron/templates/deployment.yaml
  • helm/charts/nico-machine-a-tron/templates/external-service.yaml
  • helm/charts/nico-machine-a-tron/templates/metrics-service.yaml
  • helm/charts/nico-machine-a-tron/templates/pvc.yaml
  • helm/charts/nico-machine-a-tron/templates/rbac.yaml
  • helm/charts/nico-machine-a-tron/templates/service-account.yaml
  • helm/charts/nico-machine-a-tron/templates/service-monitor.yaml
  • helm/charts/nico-machine-a-tron/templates/service.yaml
  • helm/charts/nico-machine-a-tron/tests/certificate_test.yaml
  • helm/charts/nico-machine-a-tron/tests/configmap_test.yaml
  • helm/charts/nico-machine-a-tron/values.yaml
💤 Files with no reviewable changes (1)
  • dev/deployment/devspace/machine-a-tron.yaml

Comment thread helm/charts/nico-machine-a-tron/README.md Outdated
Comment thread helm/charts/nico-machine-a-tron/README.md Outdated
Comment thread helm/charts/nico-machine-a-tron/templates/_helpers.tpl Outdated
Comment thread helm/charts/nico-machine-a-tron/templates/certificate.yaml Outdated
Comment thread helm/charts/nico-machine-a-tron/templates/deployment.yaml
Comment thread helm/charts/nico-machine-a-tron/templates/external-service.yaml
Comment thread helm/charts/nico-machine-a-tron/templates/rbac.yaml
Comment thread helm/charts/nico-machine-a-tron/tests/configmap_test.yaml
Comment thread helm/charts/nico-machine-a-tron/values.yaml
Comment thread helm/charts/nico-machine-a-tron/values.yaml
@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 264 6 23 99 6 130
machine-validation-runner 714 34 183 266 35 196
machine_validation 714 34 183 266 35 196
nvmetal-carbide 714 34 183 266 35 196
TOTAL 2412 108 572 903 111 718

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@akorobkov-nvda akorobkov-nvda force-pushed the feat/helm-nico-machine-a-tron branch from bf620a9 to 6c688ab Compare June 22, 2026 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant