CI: tidy nightly test-matrix + bump torch to 2.12.1 by leofang · Pull Request #2272 · NVIDIA/cuda-python

leofang · 2026-06-27T03:58:06Z

Summary

Collapse per-row MODE / TORCH_VER / TORCH_CUDA of nightly entries into the ENV: map so they ride the existing matrix-env injection step in test-wheel-{linux,windows}.yml. Workflow selectors (ci-nightly.yml) and job-name strings updated accordingly.
Bump latest-PyTorch rows from 2.11.0 → 2.12.1; 2.9.1 rows unchanged.
Job names now also show the torch CUDA suffix, e.g. , 2.12.1+cu126.
Align nightly section columns with the pull-request rows for readability.
Add a nightly-standard arm64 gh200 row, but comment it out for now: the gh200 runner currently hangs on stream-ordered memory allocator (cudaMallocAsync) calls. The row is left in place (with a TODO) so it can be re-enabled once the runner-side issue is resolved.

Test plan

Verify nightly matrix expansion across modes (nightly-pytorch, nightly-numba-cuda, nightly-standard) via a workflow run.
PyTorch 2.12.1 wheels (cu126 / cu130) install cleanly.

- ci/test-matrix.yml: move per-row MODE/TORCH_VER/TORCH_CUDA into the ENV map (rides the existing matrix-env injection step). Add a nightly-standard arm64 gh200 row. Bump latest-PyTorch rows from 2.11.0 to 2.12.1; 2.9.1 rows untouched. - .github/workflows/ci-nightly.yml: matrix_filter selectors now key on .ENV.MODE. - .github/workflows/test-wheel-{linux,windows}.yml: job-name format strings read TORCH_VER/MODE from matrix.ENV; TORCH_CUDA also rendered in the name (e.g. ", 2.12.1+cu126"). Drop the now-redundant TORCH_VER/TORCH_CUDA lines from the pytorch step's env block.

copy-pr-bot · 2026-06-27T03:58:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Pad PY_VER and GPU columns in the nightly section to match the widths used by the pull-request rows above (17-char PY_VER, 19-char GPU). Purely cosmetic; YAML parse and matrix expansion unchanged.

Remove before merging.

leofang · 2026-06-27T04:19:49Z

/ok to test d29bc34

leofang · 2026-06-27T04:54:42Z

Killed the hanging G+H pipeline:
https://github.com/NVIDIA/cuda-python/actions/runs/28278398318/job/83789726071?pr=2272

@bdice also saw the same issue in RMM: https://github.com/rapidsai/rmm/actions/runs/28270219058/job/83767744891?pr=2457.

Will revisit later...

The gh200 runner currently hangs on stream-ordered memory allocator calls (cudaMallocAsync). Disabling until the runner-side issue is resolved.

This reverts commit d29bc34.

leofang · 2026-06-28T03:08:29Z

/ok to test 7e002a6

leofang · 2026-06-30T04:15:00Z

Will revisit later...

I think Keith has been debugging this with Justin today (I did not follow closely). In the meanwhile I've commented out the G+H runner. We can merge and revisit in a separate PR once Justin fixes it on the infra side.

lijinf2 · 2026-06-30T23:21:58Z

+    - { ARCH: 'arm64', PY_VER: '3.12',  CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4',         GPU_COUNT: '1', DRIVER: 'latest',     ENV: { MODE: 'nightly-numba-cuda' } }
+    # nightly-standard (arm64 nightly-only runners — per runner team request)
+    # TODO: gh200 row disabled — currently hangs on stream-ordered memory
+    #       allocator (cudaMallocAsync); runner pool needs fixing first.


PR LGTM!

Spent some time understanding this TODO. It sounds a bit ambiguous and not clear to me whether we're waiting on a library fix or a runner infra fix. Either way, this does not block an approval.

github-actions · 2026-07-01T00:39:53Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

github-actions Bot added the CI/CD CI/CD infrastructure label Jun 27, 2026

leofang added 2 commits June 27, 2026 04:18

ci/test-matrix.yml: align nightly columns with pull-request rows

97ee1cf

Pad PY_VER and GPU columns in the nightly section to match the widths used by the pull-request rows above (17-char PY_VER, 19-char GPU). Purely cosmetic; YAML parse and matrix expansion unchanged.

Temporarily add push trigger to ci-nightly.yml for testing

d29bc34

Remove before merging.

leofang commented Jun 27, 2026

View reviewed changes

Comment thread ci/test-matrix.yml Outdated

This comment has been minimized.

Sign in to view

leofang added 2 commits June 28, 2026 02:55

ci/test-matrix.yml: temporarily comment out gh200 nightly row

4c70cfa

The gh200 runner currently hangs on stream-ordered memory allocator calls (cudaMallocAsync). Disabling until the runner-side issue is resolved.

Revert "Temporarily add push trigger to ci-nightly.yml for testing"

7e002a6

This reverts commit d29bc34.

leofang changed the title ~~CI: tidy nightly test-matrix + add arm64 gh200 + bump torch 2.12.1~~ CI: tidy nightly test-matrix + bump torch to 2.12.1 Jun 28, 2026

leofang marked this pull request as ready for review June 28, 2026 03:02

leofang self-assigned this Jun 28, 2026

leofang added this to the cuda.core next milestone Jun 28, 2026

leofang added enhancement Any code-related improvements P1 Medium priority - Should do labels Jun 28, 2026

leofang requested a review from mdboom June 30, 2026 04:11

leofang linked an issue Jun 30, 2026 that may be closed by this pull request

Bump tensor bridge version cap for PyTorch 2.12 #2089

Closed

leofang requested a review from rwgk June 30, 2026 19:51

lijinf2 approved these changes Jun 30, 2026

View reviewed changes

leofang merged commit f9f3849 into NVIDIA:main Jul 1, 2026
112 of 114 checks passed

leofang deleted the leofang/ci-test-matrix-env-refactor branch July 1, 2026 00:10

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI: tidy nightly test-matrix + bump torch to 2.12.1#2272

CI: tidy nightly test-matrix + bump torch to 2.12.1#2272
leofang merged 5 commits into
NVIDIA:mainfrom
leofang:leofang/ci-test-matrix-env-refactor

leofang commented Jun 27, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 27, 2026

Uh oh!

leofang commented Jun 27, 2026

Uh oh!

Uh oh!

This comment has been minimized.

leofang commented Jun 27, 2026 •

edited

Loading

Uh oh!

leofang commented Jun 28, 2026

Uh oh!

leofang commented Jun 30, 2026 •

edited

Loading

Uh oh!

lijinf2 Jun 30, 2026

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

leofang commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 27, 2026

Uh oh!

leofang commented Jun 27, 2026

Uh oh!

Uh oh!

This comment has been minimized.

leofang commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Jun 28, 2026

Uh oh!

leofang commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lijinf2 Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leofang commented Jun 27, 2026 •

edited

Loading

leofang commented Jun 27, 2026 •

edited

Loading

leofang commented Jun 30, 2026 •

edited

Loading