G4

Gemma 4 Quick Reference

Gemma 4 is Google DeepMind's Apache 2.0 open-weights multimodal LLM family with frontier intelligence across four sizes, from mobile-edge to server-grade deployment.

Getting Started

Quick Start

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user",   "content": "Explain MoE briefly."},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
inputs = processor(
    text=text, return_tensors="pt"
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(
    outputs[0][input_len:], skip_special_tokens=False
)
result = processor.parse_response(response)
# result["thinking"]  → internal reasoning chain
# result["response"] → final answer shown to user

Introduction

Released 2026-03-31 · Apache 2.0 · Built from Gemini 3 research · 140+ languages

Install

$ pip install -U transformers torch accelerate

All models: google/<model-id>

SizeModel ID
E2Bgemma-4-E2B-it
E4Bgemma-4-E4B-it
31Bgemma-4-31b-it
26B A4Bgemma-4-26b-a4b-it

Sampling Params

ParameterValue
temperature1.0
top_p0.95
top_k64

Best Practices

  • Place images/audio before text in any prompt
  • Multi-turn: omit thought blocks from conversation history
  • Include 75%+ CoT data to preserve reasoning capability
  • Freeze vision layers when text-only fine-tuning

Model Family

Model Specs

ModelArchTotal ParamsActive/TokenLayersContextModalities
E2BDense+PLE5.1B (2.3B eff)2.3B35128KText+Image+Audio
E4BDense+PLE8B (4.5B eff)4.5B42128KText+Image+Audio
31BDense30.7B30.7B60256KText+Image
26B A4BMoE25.2B3.8B30256KText+Image

Sliding window: 512 tokens (E2B/E4B) · 1024 tokens (31B/26B) · Vocab: 262K (all models)

Architecture Features

Hybrid Attention

  • Local layers: sliding window (512 or 1024 tokens)
  • Global layers: full context, interleaved with local
  • Final layer is always a global attention layer

PLE — Edge Models (E2B/E4B)

  • Per-Layer Embeddings: each decoder layer has its own small embedding table
  • Large static tables use fast lookups — not dense matrix multiply
  • Effective compute parameters « total loaded parameters
  • Enables inference under 1.5 GB VRAM (with 4-bit quantization)

p-RoPE & Shared KV Cache

  • Proportional RoPE (p-RoPE) on global layers for long-range coherence
  • Shared KV Cache across global layers reduces peak memory usage
  • Supports stable 256K context without performance degradation

MoE — 26B A4B

  • 128 total experts + 1 always-active shared expert
  • 8 experts activated per token at inference time
  • Speed similar to a 4B dense model, quality near 30B
  • Vision encoder: ~550M params (same as 31B)

Memory Requirements

ModelBF168-bit4-bit
E2B9.6 GB4.6 GB3.2 GB
E4B15 GB7.5 GB5 GB
31B58.3 GB30.4 GB17.4 GB
26B A4B48 GB25 GB15.6 GB

Base weights only — add VRAM for KV cache.

Pick a Model

Available VRAMRecommended
< 5 GBE2B (4-bit)
5–8 GBE4B (4-bit)
15–20 GBE4B (BF16)
24–32 GB31B (4-bit)
48–80 GB31B (BF16)
High throughput26B A4B

Benchmarks

Core Benchmarks

Benchmark31B26B A4BE4BE2BGemma 3 27B
MMLU Pro85.2%82.6%69.4%60.0%67.6%
MMMLU (multilingual)88.4%86.3%76.6%67.4%70.7%
AIME 2026 (math)89.2%88.3%42.5%37.5%20.8%
GPQA Diamond84.3%82.3%58.6%43.4%42.4%
LiveCodeBench v680.0%77.1%52.0%44.0%29.1%
Codeforces ELO21501718940633110
BigBench Extra Hard74.4%64.8%33.1%21.9%19.3%
Tau2 avg (agentic)76.9%68.2%42.2%24.5%16.2%
HLE no tools19.5%8.7%
HLE with search26.5%17.2%

All results for instruction-tuned models with thinking mode enabled.

Vision Benchmarks

Benchmark31B26B A4BE4BE2B
MMMU Pro76.9%73.8%52.6%44.2%
MATH-Vision85.6%82.4%59.5%52.4%
MedXPertQA MM61.3%58.1%28.7%23.5%
OmniDocBench↓0.1310.1490.1810.290

OmniDocBench = document edit distance (lower is better).

Long Context

Benchmark31B26B A4BE4BE2B
MRCR v2 128K66.4%44.1%25.4%19.1%

Arena AI (LMSYS ELO)

ModelELOOpen Rank
Gemma 4 31B1452#3
Gemma 4 26B A4B1441#6

Thinking Mode

Thinking Control

Add <|think|> to the start of the system prompt:

messages = [
    {
        "role": "system",
        "content": "<|think|>You are a math expert."
    },
    {"role": "user", "content": "Solve: 3x + 7 = 22"}
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = processor.decode(
    outputs[0][input_len:], skip_special_tokens=False
)
result = processor.parse_response(response)
# result["thinking"]  → step-by-step reasoning
# result["response"] → final answer

Thinking output structure

<|channel>thought
[Internal step-by-step reasoning — hidden from user]

[Final answer visible to user]

Disabled thinking behavior

For 31B/26B A4B with enable_thinking=False, empty tags still emitted:

<|channel>thought

[Final answer]

E2B/E4B variants skip empty tags entirely when disabled.

Control Tokens

TokenPurpose
<|think|>Enable thinking in system prompt
<|channel>thought\nOpen internal thought block
<channel|>Close thought block
<|turn>Open conversation turn
<turn|>Close conversation turn

Multi-turn Rules

  • Do not include thought content in conversation history
  • Pass only result["response"] as the model turn
  • Longer max_new_tokens for complex reasoning tasks
  • Use enable_thinking=True for AIME, proofs, debugging

Multimodal

Image Inputs

Variable aspect ratio + configurable visual token budget:

BudgetBest For
70Fast classification, video frames
140Captions, thumbnails
280General image understanding
560Charts, diagrams
1120OCR, PDF parsing, fine detail
# Image MUST come before text (required)
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": "Describe this chart."},
]}]
inputs = processor(
    text=text,
    images=image,
    return_tensors="pt"
).to(model.device)

Vision encoder: ~150M params (E2B/E4B) · ~550M params (31B/26B)

Audio Inputs

E2B and E4B only (~300M audio encoder)

  • Max audio: 30 seconds
  • Tasks: ASR and speech translation

ASR prompt

Transcribe the following speech in
{LANGUAGE} into {LANGUAGE} text.

Translation prompt

Transcribe in {SRC_LANG}, then
translate to {TARGET_LANG}.

Video Inputs

Processed as sequential image frames:

  • Max: 60 seconds at 1 fps = 60 frames
  • Use low token budget (70–140) per frame
  • E2B/E4B: also processes audio track simultaneously
# Pass frames as list of images
inputs = processor(
    text=text,
    images=[frame1, frame2, ..., frame60],
    return_tensors="pt"
)

Modality Order

Always place image/audio before text in content:

# ✅ Correct order
content = [
    {"type": "image", "image": img},
    {"type": "text",  "text": "Describe it."},
]

# ❌ Text-first breaks alignment
content = [
    {"type": "text",  "text": "Describe it."},
    {"type": "image", "image": img},
]

Deployment

Transformers

$ pip install -U transformers accelerate

BF16 (default)

from transformers import (
    AutoProcessor, AutoModelForCausalLM
)
import torch

mid = "google/gemma-4-31b-it"
processor = AutoProcessor.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(
    mid,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

4-bit quantized

from transformers import BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    mid,
    quantization_config=bnb,
    device_map="auto"
)

Ollama

$ ollama pull gemma4           # 31B (default)
$ ollama pull gemma4:e4b       # edge 4B variant
$ ollama pull gemma4:e2b       # edge 2B variant
$ ollama run  gemma4           # interactive chat

Custom GGUF (Modelfile)

FROM /path/to/fine-tuned.gguf
SYSTEM "You are a coding assistant."
$ ollama create mygemma -f Modelfile
$ ollama run mygemma

vLLM Server

$ vllm serve google/gemma-4-31B-it \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

OpenAI-compatible API at http://localhost:8000/v1

Cloud & API

PlatformNotes
Gemini APIgemma-4-31b-it
AI StudioBrowser playground
Vertex AICustom endpoint
Cloud RunServerless GPU
GKE + vLLMAuto-scale

Edge & Mobile

RuntimeBest For
AICore (Android)System-level API
LiteRT-LMIoT, Raspberry Pi
AI Edge GalleryOn-device eval
LM StudioDesktop GUI
llama.cppCPU/GPU hybrid

Fine-Tuning

QLoRA Setup

Single 16 GB GPU (T4/free Colab or Kaggle):

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    "google/gemma-4-E4B-it",
    load_in_4bit=True,
    max_seq_length=4096
)

model = FastModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=[
        "q_proj", "k_proj",
        "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

Vision fine-tuning (E2B / E4B)

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "google/gemma-4-E4B-it",
    finetune_vision_layers=False,   # freeze encoder
    finetune_language_layers=True,
    load_in_4bit=True,
)

MoE Fine-tuning

For 26B A4B — full fine-tune breaks expert routing:

  • Use LoRA bf16 only (never FFT)
  • Start with r=16, lora_alpha=16
  • Increase context length gradually after loss converges
  • Avoid targeting expert router weight matrices

Data Requirements

RequirementValue
CoT data ratio≥ 75% of training set
Thinking formatInclude <|think|> trigger
Multimodal orderImages/audio before text
Chat templateShareGPT or OpenAI format
RL rewardVerifiable final answer

Vision Fine-tune Tips

  1. Freeze vision firstfinetune_vision_layers=False
  2. Fine-tune LLM + attention + MLP projectors only
  3. Validate text-domain quality before unfreezing
  4. Unfreeze vision only for domain-specific images
  5. Low token budget (70–280) saves VRAM during training

Gemmaverse

Specialist Models

ModelDomainDescription
MedGemma 4BMedical imagingMultimodal X-ray/MRI analysis
MedGemma 27BClinical textEHR + medical report reasoning
CodeGemmaCodingCode completion & refactoring
PaliGemma 2Visual-languageFine-grained VLM & visual reasoning
ShieldGemmaSafetyLLM output safety classifier
DataGemmaFactual dataGrounded via Google Data Commons
FunctionGemmaTool callingLow-resource function call parsing

Ecosystem

Libraries & Frameworks

  • ADK (Agent Development Kit)
  • JAX Gemma Library
  • Gemma Cookbook
  • google/adk-samples starter agents

Community Variants

  • 100K+ community fine-tuned derivatives
  • RecurrentGemma (Griffin architecture)
  • EmbeddingGemma, T5Gemma 2
  • VaultGemma (differential privacy)

Also See