[Common] Use specialized unfused MXFP8 cast kernels by default by Oleg-Goncharov · Pull Request #2958 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-05-05T18:04:29Z

Description

This PR enables the fast unfused MXFP8 cast kernels by default.

Previously, these kernels were gated behind an environment variable and therefore were not used unless explicitly enabled. This change makes the specialized cast-only path the default behavior.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Removed environment variable

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps · 2026-05-05T18:06:25Z

Greptile Summary

This PR promotes the specialized unfused MXFP8 cast kernels from opt-in (via ENABLE_CAST_ONLY env var) to the default path, while adding the runtime eligibility guards that make the promotion safe.

specialized/quantize_mxfp8.cuh: Removes is_cast_only_enabled() and the ENABLE_CAST_ONLY environment variable; all four hasSpec specializations now return true unconditionally. The commented-out FIXME debugging block is also cleaned up.
quantize_mxfp8.cuh: Adds scaling_type_has_specialized_support — guarding ROWWISE on cols % 128 == 0 (to prevent out-of-bounds reads on partial 128-element row tails) and both ROWWISE/BIDIMENSIONAL on grid-dim-Y ≤ 65535 — so the fast path is taken only when it is correct to do so. Removes the previously dead COLWISE case from the specialized switch (addressed from a prior review), wraps cudaFuncSetAttribute calls with NVTE_CHECK_CUDA, and consolidates per-case cudaGetLastError checks into a single post-switch call in the generic path.

Confidence Score: 5/5

Safe to merge — the specialized kernels are guarded by correct runtime eligibility checks before being invoked, and error detection is strengthened throughout.

The behavioral change is narrow and well-defended: the cols % 128 == 0 guard prevents the known out-of-bounds access on partial row tails, the grid-dim-Y checks prevent exceeding CUDA launch limits, and COLWISE is correctly excluded from the specialized path. The generic fallback is preserved for all shapes that do not meet the criteria. cudaFuncSetAttribute and cudaGetLastError are now properly checked. The only finding is a formatting nit where the error-check call shares a line with the switch's closing brace.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/common/cast/mxfp8/specialized/quantize_mxfp8.cuh	Removes the `is_cast_only_enabled()` env-var gate and makes all four `hasSpec` specializations return `true` unconditionally, enabling the fast cast-only path by default.
transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh	Adds runtime eligibility guards (`cols % 128 == 0`, grid-dim-Y limits) before taking the specialized path; removes the now-unreachable COLWISE dead case; wraps `cudaFuncSetAttribute` and kernel launches with `NVTE_CHECK_CUDA`; consolidates per-case `cudaGetLastError` checks into one post-switch call — minor style concern on line 839 where the check sits on the same line as the closing brace.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[quantize called] --> B{hasSpec AND\nnot swizzled scales?}
    B -- No --> G[Generic kernel path]
    B -- Yes --> C{scaling_type_has_specialized_support?}
    C -- No --> G
    C -- Yes --> D{scaling_type?}
    D -- ROWWISE\ncols%128==0 AND\ngrid fits --> E[specialized rowwise\ncast-only kernel]
    D -- BIDIMENSIONAL\ngrid fits --> F[specialized bidimensional\ncast-only kernel]
    E --> CUDA_CHECK[NVTE_CHECK_CUDA\ncudaGetLastError]
    F --> CUDA_CHECK
    CUDA_CHECK --> RETURN[return]
    G --> SW{scaling_type?}
    SW -- ROWWISE --> GR[generic ROWWISE kernel]
    SW -- COLWISE --> GC[generic COLWISE kernel]
    SW -- BIDIMENSIONAL --> GB[generic BIDIMENSIONAL kernel]
    GR & GC & GB --> GE[NVTE_CHECK_CUDA\ncudaGetLastError]

_{Reviews (6): Last reviewed commit: "Merge branch 'main' into pr_fast_default..." | Re-trigger Greptile}

ksivaman

LGTM

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov · 2026-05-05T18:10:46Z

/te-ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Oleg-Goncharov · 2026-05-06T13:31:05Z

/te-ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov · 2026-05-07T14:27:00Z

/te-ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Oleg-Goncharov · 2026-05-08T17:57:26Z

/te-ci

ptrendx · 2026-05-11T23:55:33Z

For the future work we could think about doing the swizzling support for that kernel, but not sure how needed it really is.

ptrendx · 2026-05-11T23:56:51Z

/te-ci

Use fast unfused cast mxfp8 kernels by default

83025fc

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

ksivaman previously approved these changes May 5, 2026

View reviewed changes

Removed dead code

e926c9a

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov dismissed ksivaman’s stale review via e926c9a May 5, 2026 18:08

ksivaman previously approved these changes May 5, 2026

View reviewed changes

Use fast kernel for full 32-element chunks only

9ae6664

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov dismissed ksivaman’s stale review via 9ae6664 May 6, 2026 12:42

Oleg-Goncharov and others added 2 commits May 6, 2026 14:42

Merge branch 'main' into pr_fast_default_mxfp8_kernels

bbbd0f6

[pre-commit.ci] auto fixes from pre-commit.com hooks

7170385

for more information, see https://pre-commit.ci

Oleg-Goncharov and others added 2 commits May 7, 2026 14:25

Fix

9956d3a

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Merge branch 'main' into pr_fast_default_mxfp8_kernels

20af9de

Oleg-Goncharov and others added 3 commits May 8, 2026 17:54

Fixed grid size overflow

3a01e53

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d4b2b81

for more information, see https://pre-commit.ci

Merge branch 'main' into pr_fast_default_mxfp8_kernels

43f54fc

ptrendx approved these changes May 11, 2026

View reviewed changes

Merge branch 'main' into pr_fast_default_mxfp8_kernels

0fd83f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Use specialized unfused MXFP8 cast kernels by default#2958

[Common] Use specialized unfused MXFP8 cast kernels by default#2958
Oleg-Goncharov wants to merge 11 commits into
NVIDIA:mainfrom
Oleg-Goncharov:pr_fast_default_mxfp8_kernels

Oleg-Goncharov commented May 5, 2026

Uh oh!

greptile-apps Bot commented May 5, 2026 •

edited

Loading

Uh oh!

ksivaman left a comment

Uh oh!

Oleg-Goncharov commented May 5, 2026

Uh oh!

Oleg-Goncharov commented May 6, 2026

Uh oh!

Oleg-Goncharov commented May 7, 2026

Uh oh!

Oleg-Goncharov commented May 8, 2026

Uh oh!

ptrendx commented May 11, 2026

Uh oh!

ptrendx commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Oleg-Goncharov commented May 5, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov commented May 5, 2026

Uh oh!

Oleg-Goncharov commented May 6, 2026

Uh oh!

Oleg-Goncharov commented May 7, 2026

Uh oh!

Oleg-Goncharov commented May 8, 2026

Uh oh!

ptrendx commented May 11, 2026

Uh oh!

ptrendx commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 5, 2026 •

edited

Loading