Implement 4over6 NVFP4 recipe#2972
Conversation
Greptile SummaryThis PR introduces NVFP4 4over6 block-scale selection across quantization, dequantization, GEMM scaling, and Python reference paths. For each 1×16 block, both a map-to-4 (1.5× expanded E4M3 scale) and map-to-6 (standard scale) candidate are quantized; the lower-MSE candidate is stored, with ties resolved in favour of map-to-6. The global E4M3 bound changes from 448 to 256 in 4over6 mode, and the new mode is gated behind
Confidence Score: 5/5The 4over6 feature is well-contained: new CUDA kernels add template parameters rather than changing existing code paths, incompatible modes are rejected at multiple layers, and the dequantize/GEMM scale changes are conditioned on the per-tensor nvfp4_4over6 flag so existing non-4over6 paths are unaffected. The CUDA math is internally consistent with the Python reference: fp8-quantized block scales, the 1.5x map-to-4 expansion, the 256-based MSE denominator, and the tie-breaking rule all agree across kernels and reference. The plumbing is propagated consistently through every code path. The one fragility found — a double-getattr fallback in reference GEMM scaling — affects only the Python reference used in tests, not production quantization. The reference GEMM scaling in transformer_engine/pytorch/custom_recipes/quantization_ref_nvfp4.py uses a nested getattr with a silent fallback; no production paths require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Input tensor x] --> B[compute global amax]
B --> C{use_4over6?}
C -->|No| D[Standard NVFP4 S_enc = 448x6 / amax]
D --> E[block scale S_dec, quantize and store]
C -->|Yes| F[4over6 NVFP4 S_enc = 256x6 / amax]
F --> G[map-to-6 candidate]
F --> H[map-to-4 candidate with 1.5x scale]
G --> I[MSE_map6 using 6x256 denom]
H --> J[MSE_map4 using 6x256 denom]
I --> K{MSE_map4 < MSE_map6?}
J --> K
K -->|Yes| L[Store map-to-4 data]
K -->|No or tie| M[Store map-to-6 data]
L --> N[Output: nvfp4_4over6=True, global bound=256]
M --> N
E --> O[Output: nvfp4_4over6=False, global bound=448]
Reviews (5): Last reviewed commit: "Allow separate recipe 4over6 config" | Re-trigger Greptile |
|
Functionality has been verified by internal RL experiments. |
|
Need to rebase. |
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
The following tests passed: `NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 python3 -m pytest --tb=auto tests/pytorch/test_sanity.py ` `NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 NVTE_TEST_NVINSPECT_ENABLED=1 NVTE_TEST_NVINSPECT_CONFIG_FILE=tests/pytorch/debug/test_configs/dummy_feature.yaml NVTE_TEST_NVINSPECT_FEATURE_DIRS=transformer_engine/debug/features PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 python3 -m pytest --tb=auto tests/pytorch/test_sanity.py ` Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
| * its values are populated during quantization. | ||
| */ | ||
| kNVTERowScaledNVFP4 = 8, | ||
| kNVTENVFP44Over6 = 9, /*!< Whether an NVFP4 tensor uses 4over6 scaling */ |
There was a problem hiding this comment.
We are specifying this redundantly in NVTETensor and NVTEQuantizationConfig. If this option can be isolated to quantization, then we should not add clutter to the tensor. If the option is needed for downstream consumers (dequantization, GEMM), then it should be treated as part of the tensor data. I'm not especially familiar, but 4over6 seems like it should be specific to quantization.
There was a problem hiding this comment.
4over6 changes the decode convention from 1 / (6 * 448) to 1 / (6 * 256). Therefore, for our current representation 4over6 is part of the tensor data contract, not just a quantization option.
| using namespace detail; | ||
| constexpr float fp8_max = TypeExtrema<fp8e4m3>::max; // 448.0f; | ||
| constexpr float fp4_max = TypeExtrema<fp4e2m1>::max; // 6.0f; | ||
| constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max; // 448.0f; |
There was a problem hiding this comment.
How much benefit does changing the FP8 scale have on convergence? If we don't see a clear benefit, then it would be nicer to use the same scale for 4over6 and non-4over6. That way keep can keep this logic confined to quantization, and downstream consumers are completely unaffected.
If there is an impact on training quality, we should still consider disentangling the FP8 scaling from 4over6. I don't see why other NVFP4 recipes might not benefit from tweaking the scaling.
There was a problem hiding this comment.
From the original paper:
Finally, we make one modification to the computation of the tensor scale α (Equation 1) when
quantizing to NVFP4 with 4/6. When MFP4 ×MFP8 is used to compute the tensor scale, it ensures
that all quantized values will be less than 6 ×448. However, this makes it impossible to select a scale
of 4 for the blocks that contain a tensor’s largest values, because the block’s scale would need to be
448 × 6/4 = 672, which would overflow since 448 is the maximum value that can be represented by
E4M3. As a result, when computing the tensor scale, we replace MFP8 to 256 in Equation 1, since
256 is the largest E4M3 that can be multiplied by 6/4 and represented without error in E4M3, as 384.
Also:
In Section 3.1, we propose calculating the FP32 global tensor scale using 256 as the maximum FP8
E4M3 value rather than the default of 448, as this allows blocks with a tensor’s largest value to have
the option to have a largest FP4 value of 4. In Figure 6, we find that this provides a marginal benefit
over using the standard tensor scale calculation. Even though this adjustment only affects a small
number of large values, this performance gain may come from the fact that larger activation values
can have an outsize impact on model performance. This adjustment is incorporated into the remaining
experiments in this section.
There was a problem hiding this comment.
This test is okay, but it would provide much more confidence if the NVFP4 quantization tests compared against a CPU reference impl.
There was a problem hiding this comment.
Extended tests/cpp/operator/test_cast_nvfp4_transpose.cu coverage in 3bb42b1.
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Description
@HumansAnd
Implement 4over6 nvfp4 from:
FlashInfer PR:
Enable per-block map-to-4 versus map-to-6 candidate selection for NVFP4 1D quantization in the
NVFP4BlockScalingrecipe. This mode currently requires RHT, stochastic rounding, and 2D quantization to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.
Type of change
Changes
Please list the changes introduced in this PR:
NVTE_NVFP4_4OVER6=weights|activations|all, with unset preserving existing behavior, and threads the selected scope through recipes, quantizers, tensor metadata, split quantization, single-tensor quantization, and C++ tensor/config APIs.NVTE_USE_FAST_MATH, and rejecting unsupported combinations such as 2D quantization, stochastic rounding, grouped tensors, and RHT.Checklist: