[Common/PyTorch/JAX] make offset of ClampedSwiGLU configurable by hxbai · Pull Request #2938 · NVIDIA/TransformerEngine

hxbai · 2026-04-28T08:39:23Z

Description

The previous ClampedSwiGLU follows GPT-OSS, which hard-coded the offset 1.0.
DeepSeek-V4 uses ClampedSwiGLU without alpha and offset.
This PR makes the offset of ClampedSwiGLU configurable to support DeepSeek-V4.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-04-28T08:44:38Z

Greptile Summary

This PR makes the glu_linear_offset parameter of ClampedSwiGLU configurable (default 1.0, matching existing GPT-OSS behavior) to support DeepSeek-V4, which uses glu_linear_offset=0.0. The approach introduces nvte_clamped_swiglu_v2 / nvte_clamped_dswiglu_v2 C API functions to avoid breaking the public ABI while adding the new parameter across all CUDA kernels, PyTorch, and JAX code paths.

C layer: Adds glu_linear_offset to ClampedSwiGLUParam; all kernels (vectorized_pointwise.h, gated_fp8.cuh, gated_mxfp8.cuh) now read p.glu_linear_offset instead of the hardcoded 1.0f/1.
Python/JAX: ClampedSwiGLU, ScaledClampedQGeGLU, LayerNormMLP, and JAX extension params all accept and propagate the new argument with the correct backward-compatible default.
Fusion guard: fuse_grouped_mlp_ops correctly skips the grouped-MLP fusion for non-default glu_linear_offset values (consistent with the pre-existing alpha guard).

Confidence Score: 5/5

Safe to merge; the change is purely additive with a correct backward-compatible default of 1.0.

All CUDA kernels, PyTorch bindings, and JAX FFI handlers correctly propagate the new parameter. The old public C symbols are preserved unchanged. The fusion guard is consistent with the pre-existing alpha guard.

No files require special attention; all changed files are internally consistent.

Important Files Changed

Filename	Overview
transformer_engine/common/util/math.h	Adds `glu_linear_offset` field (default `1.0f`) to `ClampedSwiGLUParam` struct; backward-compatible and correctly scoped.
transformer_engine/common/util/vectorized_pointwise.h	Both forward and backward kernels now use `p.glu_linear_offset` instead of hardcoded `1.0f`; mathematically correct since offset is a constant in the gradient.
transformer_engine/common/cast/fp8/gated_fp8.cuh	Uses `p.glu_linear_offset` for the linear gate shift; derivative of clamp is offset-independent so `dgate_elt` logic is unchanged and correct.
transformer_engine/common/cast/mxfp8/gated_mxfp8.cuh	Both MXFP8 kernel variants updated to use `p.glu_linear_offset`; consistent with the FP8 changes.
transformer_engine/common/include/transformer_engine/activation.h	Adds `nvte_clamped_swiglu_v2` and `nvte_clamped_dswiglu_v2` with the offset parameter; old symbols deprecated but preserved, avoiding ABI breakage.
transformer_engine/common/activation/swiglu.cu	Old functions hard-code `glu_linear_offset=1.0f` for backward compat; new `_v2` functions pass the configurable value.
transformer_engine/pytorch/ops/_common.py	Extends the fusion guard to skip grouped-MLP fusion when `glu_linear_offset` differs from default; consistent with the pre-existing `alpha` guard.
transformer_engine/pytorch/ops/basic/swiglu.py	Adds `glu_linear_offset` to `ClampedSwiGLU` and `ScaledClampedQGeGLU`; correctly threads it through forward and backward tex calls.
transformer_engine/jax/cpp_extensions/activation.py	Updates `ClampedSwigluParams` dataclass: adds field, updates `__hash__` and `to_ffi_lowering_dict`; `clamped_linear` lambda uses the configurable offset correctly.
tests/pytorch/test_fusible_ops.py	Parametrizes `glu_linear_offset` over `(1.0, 0.0)` for both test functions, directly covering the DeepSeek-V4 use case.
tests/jax/test_custom_call_compute.py	Uses `glu_linear_offset=0.5`; exercises a non-default path but does not test the specific `0.0` value that is the stated DeepSeek-V4 use case.

_{Reviews (11): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-04-28T08:44:42Z

+ *  \param[in]     glu_linear_offset  Offset added to the linear component after clamping (default 1.0).
 *  \param[in]     stream    CUDA stream used for the operation.
 */


Breaking public C API change

nvte_clamped_swiglu and nvte_clamped_dswiglu are public symbols declared in a versioned public header. Inserting glu_linear_offset before cudaStream_t is an ABI-breaking change: any external binary or shared library compiled against the old header will silently pass the stream pointer as the offset and a garbage value as the stream, leading to undefined behavior at runtime rather than a clean compile error if called via a pre-compiled library. This should be acknowledged as a breaking change in the PR checklist, and — if this library follows semantic versioning or a compatibility guarantee — a deprecation/transition path or version bump is needed.

timmoon10

The fused op for grouped MLP is hard-coded for GPT-OSS, so we should make sure not to fuse if glu_linear_offset != 1:

TransformerEngine/transformer_engine/pytorch/ops/_common.py

Lines 180 to 183 in df0025b

    
           elif isinstance(window[1], ScaledClampedQGeGLU) and ( 
        
               abs(window[1]._clamped.alpha - 1.702) > 0.001 
        
               or not _nvidia_cudnn_frontend_supports_scaled_clamped_qgeglu() 
        
           ):

timmoon10 · 2026-04-28T18:15:43Z

/te-ci

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

vthumbe1503 · 2026-05-06T18:36:11Z


 void nvte_clamped_swiglu(const NVTETensor input, NVTETensor output, float limit, float alpha,
-                         cudaStream_t stream) {
+                         float glu_linear_offset, cudaStream_t stream) {


Can we define new APIs named nvte_clamped_swiglu_v2 and nvte_clamped_dswiglu_v2
and deprecate this API here to not break backward compatibility?

rewrited this part

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

vthumbe1503 · 2026-05-12T06:01:20Z

/te-ci

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

vthumbe1503 · 2026-05-13T15:20:18Z

/te-ci

jberchtold-nvidia

Overall looks pretty good from the JAX side, thanks for adding the JAX changes too! Left a couple small comments

jberchtold-nvidia · 2026-05-13T16:01:57Z

                                      ::xla::ffi::StructMember<float>("limit"),
-                                      ::xla::ffi::StructMember<float>("alpha"));
+                                      ::xla::ffi::StructMember<float>("alpha"),
+                                      ::xla::ffi::StructMember<float>("glu_linear_offset"));


can we add a default value for users on HLO from a previous version? Would glu_linear_offset=1 be the same as the current behavior on main?

Yes, glu_linear_offset=1 is consistent with the current behavior.

Could you point me on how to add the default value on HLO? Thanks.

@hxbai So I had thought this was easy to add a default value for, but I realized it's a different case where it's a function argument attribute, not a struct field, where we have supported default values in XLA FFIs in TE/JAX previously.

I reached out to the XLA team and heard using std::optional may be supported. Can you try this?

struct XXXX { ... std::optional<float> glu_linear_offset; };

then when using the value glu_linear_offset_value = glu_linear_offset.value_or(1.0f)

If it doesn't work, then let me know we can keep it without a default and I'll approve the PR from the JAX side. Thanks!

Added the fix

Tests failed due to the changes. It seems optional is not supported and I reverted. Is it OK?

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

jberchtold-nvidia · 2026-05-15T15:52:41Z

/te-ci

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-05-16T06:45:02Z

Want your agent to iterate on Greptile's feedback? Try greploops.

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

timmoon10 reviewed Apr 28, 2026

View reviewed changes

swiglu offset

86b9199

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai force-pushed the swiglu_offset branch from 1ed113b to 86b9199 Compare April 29, 2026 00:05

hxbai marked this pull request as draft April 29, 2026 00:28

fix fusion pattern check

1eab899

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai marked this pull request as ready for review April 29, 2026 01:01

vthumbe1503 reviewed May 6, 2026

View reviewed changes

vthumbe1503 and others added 3 commits May 6, 2026 11:38

Merge branch 'main' into swiglu_offset

aec7013

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

use swiglu_v2

2aca498

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

add default value to v1

4d7be63

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai added 2 commits May 12, 2026 15:13

Merge branch 'main' into swiglu_offset

96d99ba

fix test

9a77ebf

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai requested review from Oleg-Goncharov, jberchtold-nvidia, ksivaman and ptrendx as code owners May 13, 2026 08:02

Merge branch 'main' into swiglu_offset

72f2a57

jberchtold-nvidia reviewed May 13, 2026

View reviewed changes

add default value to jax version

1d323d4

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap NVIDIA/Megatron-LM#4815

Open

69 tasks

hxbai and others added 2 commits May 16, 2026 06:38

revert the default value change

07a69a7

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e1fec14

for more information, see https://pre-commit.ci

	elif isinstance(window[1], ScaledClampedQGeGLU) and (
	abs(window[1]._clamped.alpha - 1.702) > 0.001
	or not _nvidia_cudnn_frontend_supports_scaled_clamped_qgeglu()
	):

Conversation

hxbai commented Apr 28, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 28, 2026

Uh oh!

vthumbe1503 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

hxbai May 12, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented May 12, 2026

Uh oh!

vthumbe1503 commented May 13, 2026

Uh oh!

jberchtold-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

jberchtold-nvidia May 13, 2026

Choose a reason for hiding this comment

Uh oh!

hxbai May 14, 2026

Choose a reason for hiding this comment

Uh oh!

jberchtold-nvidia May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hxbai May 15, 2026

Choose a reason for hiding this comment

Uh oh!

hxbai May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jberchtold-nvidia commented May 15, 2026

Uh oh!

greptile-apps Bot commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Apr 28, 2026 •

edited

Loading