Implement 4over6 NVFP4 recipe by zianglih · Pull Request #2972 · NVIDIA/TransformerEngine

zianglih · 2026-05-09T03:50:20Z

Description

Implement 4over6 nvfp4 from:

Paper: https://arxiv.org/abs/2512.02010
Code: https://github.com/mit-han-lab/fouroversix

FlashInfer PR:

Support 4over6 nvfp4 for quantizer and fused MoE flashinfer-ai/flashinfer#3264

Enable per-block map-to-4 versus map-to-6 candidate selection for NVFP4 1D quantization in the NVFP4BlockScaling recipe. This mode currently requires RHT, stochastic rounding, and 2D quantization to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.

This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Adds scoped NVFP4 4over6 control through NVTE_NVFP4_4OVER6=weights|activations|all, with unset preserving existing behavior, and threads the selected scope through recipes, quantizers, tensor metadata, split quantization, single-tensor quantization, and C++ tensor/config APIs.
Implements 1D NVFP4 4over6 quantization in the existing NVFP4 CUDA paths by comparing TE-style map-to-4 and map-to-6 FP4 candidates with the original 4over6 MSE rule, choosing map-to-6 on ties, honoring NVTE_USE_FAST_MATH, and rejecting unsupported combinations such as 2D quantization, stochastic rounding, grouped tensors, and RHT.
Updates dequantization and NVFP4 GEMM scaling to respect per-tensor 4over6 metadata, using 256-based normalization for 4over6 tensors and 448-based normalization for regular NVFP4 tensors without requiring callers to do hidden rescaling.
Extends the Python reference implementation to mirror the intended ground truth, meaning TE-style candidate quantization plus original 4over6 MSE/compare logic, and uses this reference for bitwise exact tests where fast math is disabled.
Expands C++ and Python coverage across exact NVFP4 quantization, GEMM, dequantization, recipe scope resolution, quantized tensor handling, numerics, sanity, CUDA graph, torch compile, CPU offload, fusible ops, and backward override paths, while documenting the new environment variable and known unsupported modes.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-09T03:55:44Z

Greptile Summary

This PR introduces NVFP4 4over6 block-scale selection across quantization, dequantization, GEMM scaling, and Python reference paths. For each 1×16 block, both a map-to-4 (1.5× expanded E4M3 scale) and map-to-6 (standard scale) candidate are quantized; the lower-MSE candidate is stored, with ties resolved in favour of map-to-6. The global E4M3 bound changes from 448 to 256 in 4over6 mode, and the new mode is gated behind NVTE_NVFP4_4OVER6={weights,activations,all}.

Core CUDA paths: Adds quantize_4over6_nvfp4.cuh with GPU MSE comparison primitives; threads kUse4Over6/kUseFastMath template parameters through the tuned-1D and vector-blockwise kernels; updates the dequantize kernel and nvfp4_compute_per_tensor_scale to use 256 instead of 448 for 4over6 tensors.
Python / C++ plumbing: Propagates use_4over6 through NVFP4Quantizer, NVFP4Tensor/GroupedTensor storage, recipe state, split-quantize helpers, all-gather metadata, and the C++ TensorWrapper/QuantizationConfigWrapper APIs.
Reference implementation & tests: Extends NVFP4QuantizerRef._quantize_blockwise_4over6_reference to mirror the CUDA MSE logic in Python for exact-match tests; expands the test matrix across quantize-exact, GEMM-exact, dequantize, sanity, recipe, CPU-offload, CUDA-graphs, torch-compile, and backward-override suites.

Confidence Score: 5/5

The 4over6 feature is well-contained: new CUDA kernels add template parameters rather than changing existing code paths, incompatible modes are rejected at multiple layers, and the dequantize/GEMM scale changes are conditioned on the per-tensor nvfp4_4over6 flag so existing non-4over6 paths are unaffected.

The CUDA math is internally consistent with the Python reference: fp8-quantized block scales, the 1.5x map-to-4 expansion, the 256-based MSE denominator, and the tie-breaking rule all agree across kernels and reference. The plumbing is propagated consistently through every code path. The one fragility found — a double-getattr fallback in reference GEMM scaling — affects only the Python reference used in tests, not production quantization.

The reference GEMM scaling in transformer_engine/pytorch/custom_recipes/quantization_ref_nvfp4.py uses a nested getattr with a silent fallback; no production paths require special attention.

Important Files Changed

Filename	Overview
transformer_engine/common/cast/nvfp4/quantize_4over6_nvfp4.cuh	New file implementing 4over6 GPU primitives: MSE computation with FP16 dequant intermediate, candidate selection, and packing helpers; all paths guarded behind FP4_TYPE_SUPPORTED and ARCH_BLACKWELL_FAMILY
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh	Adds USE_4OVER6 template parameter to both rowwise_scaling and colwise_scaling, threads global_amax_colwise into the colwise path, and switches the kernel dispatch macro nest to include 4over6 x fast-math combinations
transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu	Adds kUse4Over6/kUseFastMath template params to block_scaled_1d_cast_transpose_kernel; replaces local ComputeGlobalEncodeScaleFP4 with the shared compute_global_encode_scaling_factor_FP4 template; adds use_fast_math and use_4over6 to the function signature
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh	Promotes row_scaled_nvfp4 and use_4over6 from runtime booleans to compile-time template parameters; correctly switches the factor_inv denominator to 6x256 for 4over6 tensors
transformer_engine/common/recipe/nvfp4.cu	Bug fix: compute_nvfp4_per_tensor_scale_kernel now receives fp8_max_A/fp8_max_B parameters instead of the hardcoded 448 constant, correctly using 256 for 4over6 tensors
transformer_engine/pytorch/custom_recipes/quantization_ref_nvfp4.py	Adds _quantize_blockwise_4over6_reference mirroring the CUDA MSE logic; updates GEMM scaling to use 256/448 based on per-operand use_4over6; shape arithmetic across block/row dimensions appears correct
transformer_engine/pytorch/csrc/extensions/cast.cpp	Threads use_4over6 into bulk-alloc tensor construction, split-quantize helpers, and group-quantize guard
transformer_engine/pytorch/tensor/nvfp4_tensor.py	Consistently carries use_4over6 through new, clone, _view/_reshape autograd functions, reduce_ex, and the all-gather rowwise collect/reconstruct path
transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py	Adds use_4over6 property alongside the row_scaled_nvfp4 property refactor; reset in clear() and propagated through from_quantizer and split_into_quantized_tensors

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Input tensor x] --> B[compute global amax]
    B --> C{use_4over6?}
    C -->|No| D[Standard NVFP4 S_enc = 448x6 / amax]
    D --> E[block scale S_dec, quantize and store]
    C -->|Yes| F[4over6 NVFP4 S_enc = 256x6 / amax]
    F --> G[map-to-6 candidate]
    F --> H[map-to-4 candidate with 1.5x scale]
    G --> I[MSE_map6 using 6x256 denom]
    H --> J[MSE_map4 using 6x256 denom]
    I --> K{MSE_map4 < MSE_map6?}
    J --> K
    K -->|Yes| L[Store map-to-4 data]
    K -->|No or tie| M[Store map-to-6 data]
    L --> N[Output: nvfp4_4over6=True, global bound=256]
    M --> N
    E --> O[Output: nvfp4_4over6=False, global bound=448]

_{Reviews (5): Last reviewed commit: "Allow separate recipe 4over6 config" | Re-trigger Greptile}

zianglih · 2026-05-11T07:16:24Z

Functionality has been verified by internal RL experiments.
We may want to allow separate 4over6 config for weights and activations, maybe NVTE_NVFP4_ENABLE_4OVER6=weights|activations|all.

zianglih · 2026-05-11T21:17:24Z

Need to rebase.

Signed-off-by: Ziang Li <ziangli@umich.edu>

The following tests passed: `NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 python3 -m pytest --tb=auto tests/pytorch/test_sanity.py ` `NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 NVTE_TEST_NVINSPECT_ENABLED=1 NVTE_TEST_NVINSPECT_CONFIG_FILE=tests/pytorch/debug/test_configs/dummy_feature.yaml NVTE_TEST_NVINSPECT_FEATURE_DIRS=transformer_engine/debug/features PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 python3 -m pytest --tb=auto tests/pytorch/test_sanity.py ` Signed-off-by: Ziang Li <ziangli@umich.edu>

Signed-off-by: Ziang Li <ziangli@umich.edu>

timmoon10 · 2026-05-11T23:44:32Z

   *  its values are populated during quantization.
   */
  kNVTERowScaledNVFP4 = 8,
+  kNVTENVFP44Over6 = 9, /*!< Whether an NVFP4 tensor uses 4over6 scaling */


We are specifying this redundantly in NVTETensor and NVTEQuantizationConfig. If this option can be isolated to quantization, then we should not add clutter to the tensor. If the option is needed for downstream consumers (dequantization, GEMM), then it should be treated as part of the tensor data. I'm not especially familiar, but 4over6 seems like it should be specific to quantization.

4over6 changes the decode convention from 1 / (6 * 448) to 1 / (6 * 256). Therefore, for our current representation 4over6 is part of the tensor data contract, not just a quantization option.

timmoon10 · 2026-05-12T00:04:07Z

  using namespace detail;
-  constexpr float fp8_max = TypeExtrema<fp8e4m3>::max;  // 448.0f;
-  constexpr float fp4_max = TypeExtrema<fp4e2m1>::max;  // 6.0f;
+  constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max;  // 448.0f;


How much benefit does changing the FP8 scale have on convergence? If we don't see a clear benefit, then it would be nicer to use the same scale for 4over6 and non-4over6. That way keep can keep this logic confined to quantization, and downstream consumers are completely unaffected.

If there is an impact on training quality, we should still consider disentangling the FP8 scaling from 4over6. I don't see why other NVFP4 recipes might not benefit from tweaking the scaling.

From the original paper:

Finally, we make one modification to the computation of the tensor scale α (Equation 1) when
quantizing to NVFP4 with 4/6. When MFP4 ×MFP8 is used to compute the tensor scale, it ensures
that all quantized values will be less than 6 ×448. However, this makes it impossible to select a scale
of 4 for the blocks that contain a tensor’s largest values, because the block’s scale would need to be
448 × 6/4 = 672, which would overflow since 448 is the maximum value that can be represented by
E4M3. As a result, when computing the tensor scale, we replace MFP8 to 256 in Equation 1, since
256 is the largest E4M3 that can be multiplied by 6/4 and represented without error in E4M3, as 384.

Also:

In Section 3.1, we propose calculating the FP32 global tensor scale using 256 as the maximum FP8
E4M3 value rather than the default of 448, as this allows blocks with a tensor’s largest value to have
the option to have a largest FP4 value of 4. In Figure 6, we find that this provides a marginal benefit
over using the standard tensor scale calculation. Even though this adjustment only affects a small
number of large values, this performance gain may come from the fact that larger activation values
can have an outsize impact on model performance. This adjustment is incorporated into the remaining
experiments in this section.

timmoon10 · 2026-05-12T00:25:11Z

This test is okay, but it would provide much more confidence if the NVFP4 quantization tests compared against a CPU reference impl.

Extended tests/cpp/operator/test_cast_nvfp4_transpose.cu coverage in 3bb42b1.

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih marked this pull request as draft May 9, 2026 03:50

zianglih changed the title ~~Implement 4over6 nvfp4~~ Implement 4over6 nvfp4 recipe May 9, 2026

zianglih mentioned this pull request May 9, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

30 tasks

zianglih changed the title ~~Implement 4over6 nvfp4 recipe~~ Implement 4over6 NVFP4 recipe May 9, 2026

greptile-apps Bot reviewed May 9, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp

Comment thread transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu Outdated

Comment thread transformer_engine/common/recipe/__init__.py

ziang-and force-pushed the 4over6 branch from f3f4127 to 9ff4c3a Compare May 9, 2026 08:53

zianglih commented May 9, 2026

View reviewed changes

Comment thread tests/pytorch/test_sanity.py Outdated

zianglih mentioned this pull request May 9, 2026

Support 4over6 nvfp4 for quantizer and fused MoE flashinfer-ai/flashinfer#3264

Open

5 tasks

ziang-and force-pushed the 4over6 branch from a989400 to 097c7aa Compare May 10, 2026 09:11

zianglih marked this pull request as ready for review May 10, 2026 09:36

ptrendx assigned Oleg-Goncharov May 11, 2026

ptrendx requested a review from negvet May 11, 2026 17:12

ptrendx added community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. fp4 labels May 11, 2026

zianglih marked this pull request as draft May 11, 2026 21:17

zianglih added 12 commits May 11, 2026 14:24

Initial implementation

88c8414

Signed-off-by: Ziang Li <ziangli@umich.edu>

Make 4over6 compile time for dequant

c4cc629

Signed-off-by: Ziang Li <ziangli@umich.edu>

Expand 1d fwd+bwd test

f9e29fb

Signed-off-by: Ziang Li <ziangli@umich.edu>

Refactor

0d17c4f

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up

ef3544d

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up

ab0d982

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add gemm test

9adcf01

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add more tests and fix offload

badd325

Signed-off-by: Ziang Li <ziangli@umich.edu>

Fix offload

9e41bd9

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up arg

035d3ba

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add more test

8738373

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add more tests

2068d25

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih added 11 commits May 11, 2026 14:24

Clean up test

cf20b7e

Signed-off-by: Ziang Li <ziangli@umich.edu>

Refactor cuh kernel impl

097bd27

Signed-off-by: Ziang Li <ziangli@umich.edu>

Further extract

cc8f47b

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up

6275597

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add recipe_id

9e0affe

Signed-off-by: Ziang Li <ziangli@umich.edu>

Fix failing unit tests

51c23b1

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up test

42eaa35

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up

a96904d

Signed-off-by: Ziang Li <ziangli@umich.edu>

Refactor ref

0e52cb2

Signed-off-by: Ziang Li <ziangli@umich.edu>

Update comments and docs

5990b71

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the 4over6 branch from 53aad5e to 1dcd003 Compare May 11, 2026 21:41

zianglih added 2 commits May 11, 2026 14:51

Refactor QuantizerRole

69b7df5

Signed-off-by: Ziang Li <ziangli@umich.edu>

Allow separate recipe 4over6 config

98fdeee

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih marked this pull request as ready for review May 11, 2026 22:36

timmoon10 requested changes May 12, 2026

View reviewed changes

timmoon10 reviewed May 12, 2026

View reviewed changes

zianglih marked this pull request as draft May 12, 2026 02:01

zianglih added 8 commits May 11, 2026 19:45

Support 2d

3bb42b1

Signed-off-by: Ziang Li <ziangli@umich.edu>

Refactor 2d

888e91c

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up anti pattern

bdc961a

Signed-off-by: Ziang Li <ziangli@umich.edu>

Enforce 4over6 consistency

1181823

Signed-off-by: Ziang Li <ziangli@umich.edu>

Update comments

c80399f

Signed-off-by: Ziang Li <ziangli@umich.edu>

Update docs

a14c59f

Signed-off-by: Ziang Li <ziangli@umich.edu>

Fix test

3f6ad8f

Signed-off-by: Ziang Li <ziangli@umich.edu>

Drop test_fusible_ops

69f9ccc

Signed-off-by: Ziang Li <ziangli@umich.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement 4over6 NVFP4 recipe#2972

Implement 4over6 NVFP4 recipe#2972
zianglih wants to merge 33 commits into
NVIDIA:mainfrom
zianglih:4over6

zianglih commented May 9, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zianglih commented May 11, 2026 •

edited

Loading

Uh oh!

zianglih commented May 11, 2026

Uh oh!

timmoon10 May 11, 2026 •

edited

Loading

Uh oh!

zianglih May 12, 2026

Uh oh!

timmoon10 May 12, 2026

Uh oh!

zianglih May 12, 2026

Uh oh!

Uh oh!

Uh oh!

timmoon10 May 12, 2026

Uh oh!

zianglih May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zianglih commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zianglih commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented May 11, 2026

Uh oh!

timmoon10 May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zianglih May 12, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

zianglih May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timmoon10 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

zianglih May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zianglih commented May 9, 2026 •

edited

Loading

greptile-apps Bot commented May 9, 2026 •

edited

Loading

zianglih commented May 11, 2026 •

edited

Loading

timmoon10 May 11, 2026 •

edited

Loading