Skip to content

Feat/gemma4 adapters#1385

Open
huseyincavusbi wants to merge 10 commits into
TransformerLensOrg:devfrom
huseyincavusbi:feat/gemma4-adapters
Open

Feat/gemma4 adapters#1385
huseyincavusbi wants to merge 10 commits into
TransformerLensOrg:devfrom
huseyincavusbi:feat/gemma4-adapters

Conversation

@huseyincavusbi

Copy link
Copy Markdown
Contributor

Description

This PR adds TransformerBridge support for the Gemma 4 model family (E2B, E4B, 26B-A4B, and 31B) through a single unified Gemma4ArchitectureAdapter.

Key Implementation Details

  • Unified Adapter (gemma4.py): Dynamically handles all 4 variants by evaluating initialization configuration flags:
    • MoE Blocks: Submodules conditionally spin up only when enable_moe_block=True (specifically for the 26B variant).
    • KV-Sharing: Dropped gracefully when num_kv_shared_layers > 0 (for E2B/E4B).
    • PLE Embeddings: Surfaced dynamically when hidden_size_per_layer_input > 0.
    • Weight Processing: Maps and converts Gemma 4's joint QKV layout, dual RoPE configurations, alternating sliding/full attention mechanisms, logit softcapping, and RMSNorm.
    • Includes 45 dedicated unit tests verifying config attributes, MoE behavior, and weight conversions.
  • Shared-Library Updates (3 files, fully opt-in, zero regressions on existing adapter tests):
    1. position_embeddings_attention.py: Applies V norm post-reshape (Gemma 4 is the first architecture featuring per-head value normalization). Handles KV-sharing delegation to Hugging Face's original attention implementation when K/V submodules are omitted. Caches computed KV states in shared_kv_states post-RoPE for structural layer reuse.
    2. bridge.py: Introduces a use_native_generate opt-in flag. This bypasses a current Hugging Face transformers dev-version issue where eager attention causes a KV-cache dimension mismatch during generation. Setting this flag (scoped strictly to this adapter) delegates processing to HF's native generate() utilizing SDPA.
    3. main_benchmark.py: Fixes pad_token_id assignment when eos_token_id is a list (Gemma4 uses [1, 106]), taking the first element.

Verification & Performance

All models have been validated.

Fixes #1297

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

Adds a text-only adapter covering both Gemma4ForConditionalGeneration
(E2B/E4B/31B/26B-A4B) and Gemma4UnifiedForConditionalGeneration (12B),
addressing TransformerLensOrg#1297.

Gemma 4 layers are heterogeneous: KV-shared layers drop k/v projections,
K==V layers drop v_proj, and per-layer-embedding / MoE submodules appear
only on some variants -- all mapped optional and delegated to HF. Unlike
Gemma 1-3, Gemma4RMSNorm has no (1+weight) offset.

Adds DelegatedAttentionBlockBridge (drops the split-QKV fork aliases, as
MLABlockBridge does) so hook-alias resolution stays clean when attention
is delegated wholesale to HF.

google/gemma-4-E2B-it passes verification (P1 100%, P2 100%, P4 94.7%).

- New adapter + four-place registration + gemma4/gemma4_unified model_type mappings
- 10 checkpoints added to the model registry
- Unit + integration tests (logit parity vs HF on all three structural variants)
@huseyincavusbi huseyincavusbi marked this pull request as draft June 14, 2026 10:49
@jlarson4 jlarson4 changed the base branch from main to dev June 15, 2026 15:47

@jlarson4 jlarson4 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @huseyincavusbi glad to finally see this come through. I have a couple comments that exist below, take a look when you have a moment and let me know what you think.

Additionally, @punishell has recently opened #1377, which is a parallel implementation of Gemma4. I'd like to include bits of both your implementations where it makes sense & is relevant. They came up with a very straight forward solution for the KV-cache issue that might be of use to you, if you want to try rebasing your work onto theirs as an extension point. I am thinking there may be a way to use their DelegatedAttentionBlockBridge in combination with your work spent on adding support for Gemma4 to position_embeddings_atttention to provide even better overall support.

There are more moving parts here than anticipated, if you have questions please feel free to ask.

Comment thread tests/unit/model_bridge/test_gemma4.py Outdated
Comment thread transformer_lens/model_bridge/bridge.py Outdated
Comment thread transformer_lens/model_bridge/bridge.py Outdated
…Bridge expects positional 'vision_features' but Gemma4's Gemma4MultimodalEmbedder.forward() takes 'inputs_embeds' kwarg
…ma4's boi_token is a marker, image_token is the expandable placeholder
@huseyincavusbi huseyincavusbi marked this pull request as ready for review June 22, 2026 14:26
@huseyincavusbi

Copy link
Copy Markdown
Contributor Author

Hi @jlarson4. Thanks for the review. Rebased onto #1377's DelegatedAttentionBlockBridge as suggested, dropped use_native_generate, and added multimodal vision support. All Gemma4 variants verified on E2B/E4B P1+P4+P7, 26B/31B P1+P4+P7 with --no-hf-reference.

@jlarson4

Copy link
Copy Markdown
Collaborator

Awesome thank you @huseyincavusbi! I will review this today and let you know if I have any additional comments.

@jlarson4

jlarson4 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

@huseyincavusbi A couple things we probably want to look into before we merge this:

  1. The verification run is showing degenerate generation in P4 and P7. We are getting loops "expensive and expensive" / "image image image" on simple prompts. Can you run HF-native generation on the same 31B model to see if it produces coherent text? If HF is coherent and the Bridge isn't, that's a real Bridge bug in the delegated path (RoPE / logit-softcapping / attention / generation config). If both loop, it's decoding config. Either we discover a bug that needs fixing, or we add to the verification note that there is a decoding issue.
  2. There is an unrelated bug in verify models that it would be very helpful for you to fix. At line 1119 of verify_models.py): update the guard to torch.backends.mps.is_available() instead of hasattr(torch.mps, "empty_cache"), mirroring the torch.cuda.is_available() branch above. As-is it crashes every CUDA run after scoring, making it look like the run failed.

There are a couple other benchmark-related bugs that I will file separately after this PR is merged, but none that will be blocking.

Thank you for providing the logs for your verification runs they were crucial in putting together this review. Being able to see the 31B model output was invaluable.

EDIT: I should also clarify that I will resolve the merge conflict before I merge the PR once your final changes are in, don't worry about it for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Proposal] Gemma4 Architecture Adapter

3 participants