Tutorial

Gemma 4 Installation Tutorial

Step-by-step instructions to install and run Gemma 4 on your machine — from setting up a Python environment to running your first inference. Covers both the Python SDK and Ollama.

Python pip Ollama CUDA

Prerequisites

System Requirements

OS	Linux, macOS, or Windows (WSL2)
Python	3.9 or higher (3.11 recommended)
GPU	NVIDIA with 6 GB+ VRAM (optional but recommended)
CUDA	12.1+ (if using GPU)
RAM	16 GB+ system RAM
Disk	20–60 GB free space per model

Hugging Face Account

Create a free account at huggingface.co
Visit the model page (e.g. google/gemma-4-E4B-it)
Click "Access repository" and accept the license
Generate a read token at Settings → Access Tokens

Model access is free — Google requires only license agreement.

Step 1 — Set Up Python Environment

Using Conda (Recommended)

# Create a dedicated Python environment
conda create -n gemma4 python=3.11 -y
conda activate gemma4

Using venv

python -m venv gemma4-env
# Linux/macOS:
source gemma4-env/bin/activate
# Windows:
gemma4-env\Scripts\activate

Step 2 — Install Dependencies

# Core dependencies
pip install -U transformers torch accelerate

# Optional: quantization support
pip install bitsandbytes

# Optional: faster inference
pip install flash-attn --no-build-isolation

For GPU users: PyTorch automatically picks CUDA if available. Verify with python -c "import torch; print(torch.cuda.is_available())".

Step 3 — Authenticate with Hugging Face

# Install Hugging Face CLI
pip install huggingface_hub

# Authenticate (get your token at huggingface.co/settings/tokens)
huggingface-cli login

Step 4 — Download the Model

Download via Python

# Download the E4B model (recommended for 8-16 GB VRAM)
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="google/gemma-4-E4B-it",
    local_dir="./models/gemma4-e4b"
)

Model Size Reference

Model	Download Size
E2B	~4.6 GB
E4B	~8.0 GB
31B	~58 GB
26B A4B	~48 GB

Step 5 — Verify Installation

import torch
from transformers import AutoProcessor, AutoModelForCausalLM

print("PyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    print("VRAM:", round(torch.cuda.get_device_properties(0).total_memory / 1e9, 1), "GB")

# Quick load test
processor = AutoProcessor.from_pretrained("google/gemma-4-E4B-it")
print("Processor loaded OK")

Step 6 — Run Your First Inference

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Explain what Gemma 4 is in 2 sentences."}]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Alternative — Ollama (No Python Required)

Prefer a simpler setup? Ollama handles everything automatically:

# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com
# Then in terminal:
ollama pull gemma4:e4b
ollama run gemma4:e4b