CI: tidy nightly test-matrix + bump torch to 2.12.1#2272
Conversation
- ci/test-matrix.yml: move per-row MODE/TORCH_VER/TORCH_CUDA into the
ENV map (rides the existing matrix-env injection step). Add a
nightly-standard arm64 gh200 row. Bump latest-PyTorch rows from
2.11.0 to 2.12.1; 2.9.1 rows untouched.
- .github/workflows/ci-nightly.yml: matrix_filter selectors now key on
.ENV.MODE.
- .github/workflows/test-wheel-{linux,windows}.yml: job-name format
strings read TORCH_VER/MODE from matrix.ENV; TORCH_CUDA also rendered
in the name (e.g. ", 2.12.1+cu126"). Drop the now-redundant
TORCH_VER/TORCH_CUDA lines from the pytorch step's env block.
Pad PY_VER and GPU columns in the nightly section to match the widths used by the pull-request rows above (17-char PY_VER, 19-char GPU). Purely cosmetic; YAML parse and matrix expansion unchanged.
Remove before merging.
|
/ok to test d29bc34 |
This comment has been minimized.
This comment has been minimized.
|
Killed the hanging G+H pipeline: @bdice also saw the same issue in RMM: https://github.com/rapidsai/rmm/actions/runs/28270219058/job/83767744891?pr=2457. Will revisit later... |
The gh200 runner currently hangs on stream-ordered memory allocator calls (cudaMallocAsync). Disabling until the runner-side issue is resolved.
This reverts commit d29bc34.
|
/ok to test 7e002a6 |
I think Keith has been debugging this with Justin today (I did not follow closely). In the meanwhile I've commented out the G+H runner. We can merge and revisit in a separate PR once Justin fixes it on the infra side. |
| - { ARCH: 'arm64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', ENV: { MODE: 'nightly-numba-cuda' } } | ||
| # nightly-standard (arm64 nightly-only runners — per runner team request) | ||
| # TODO: gh200 row disabled — currently hangs on stream-ordered memory | ||
| # allocator (cudaMallocAsync); runner pool needs fixing first. |
There was a problem hiding this comment.
PR LGTM!
Spent some time understanding this TODO. It sounds a bit ambiguous and not clear to me whether we're waiting on a library fix or a runner infra fix. Either way, this does not block an approval.
This comment has been minimized.
This comment has been minimized.
1 similar comment
|
Summary
MODE/TORCH_VER/TORCH_CUDAof nightly entries into theENV:map so they ride the existing matrix-env injection step intest-wheel-{linux,windows}.yml. Workflow selectors (ci-nightly.yml) and job-name strings updated accordingly.2.11.0→2.12.1;2.9.1rows unchanged., 2.12.1+cu126.nightly-standardarm64gh200row, but comment it out for now: the gh200 runner currently hangs on stream-ordered memory allocator (cudaMallocAsync) calls. The row is left in place (with a TODO) so it can be re-enabled once the runner-side issue is resolved.Test plan
nightly-pytorch,nightly-numba-cuda,nightly-standard) via a workflow run.