G4
Tutorial
Gemma 4 Installation Tutorial
Step-by-step instructions to install and run Gemma 4 on your machine — from setting up a Python environment to running your first inference. Covers both the Python SDK and Ollama.
Python pip Ollama CUDA
Prerequisites
System Requirements
| OS | Linux, macOS, or Windows (WSL2) |
| Python | 3.9 or higher (3.11 recommended) |
| GPU | NVIDIA with 6 GB+ VRAM (optional but recommended) |
| CUDA | 12.1+ (if using GPU) |
| RAM | 16 GB+ system RAM |
| Disk | 20–60 GB free space per model |
Hugging Face Account
- Create a free account at huggingface.co
- Visit the model page (e.g. google/gemma-4-E4B-it)
- Click "Access repository" and accept the license
- Generate a read token at Settings → Access Tokens
Model access is free — Google requires only license agreement.
Step 1 — Set Up Python Environment
Using Conda (Recommended)
# Create a dedicated Python environment
conda create -n gemma4 python=3.11 -y
conda activate gemma4Using venv
python -m venv gemma4-env
# Linux/macOS:
source gemma4-env/bin/activate
# Windows:
gemma4-env\Scripts\activateStep 2 — Install Dependencies
# Core dependencies
pip install -U transformers torch accelerate
# Optional: quantization support
pip install bitsandbytes
# Optional: faster inference
pip install flash-attn --no-build-isolationFor GPU users: PyTorch automatically picks CUDA if available. Verify with python -c "import torch; print(torch.cuda.is_available())".
Step 3 — Authenticate with Hugging Face
# Install Hugging Face CLI
pip install huggingface_hub
# Authenticate (get your token at huggingface.co/settings/tokens)
huggingface-cli loginStep 4 — Download the Model
Download via Python
# Download the E4B model (recommended for 8-16 GB VRAM)
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="google/gemma-4-E4B-it",
local_dir="./models/gemma4-e4b"
)Model Size Reference
| Model | Download Size |
|---|---|
| E2B | ~4.6 GB |
| E4B | ~8.0 GB |
| 31B | ~58 GB |
| 26B A4B | ~48 GB |
Step 5 — Verify Installation
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
print("PyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("GPU:", torch.cuda.get_device_name(0))
print("VRAM:", round(torch.cuda.get_device_properties(0).total_memory / 1e9, 1), "GB")
# Quick load test
processor = AutoProcessor.from_pretrained("google/gemma-4-E4B-it")
print("Processor loaded OK")Step 6 — Run Your First Inference
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
MODEL_ID = "google/gemma-4-E4B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [{"role": "user", "content": "Explain what Gemma 4 is in 2 sentences."}]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))Alternative — Ollama (No Python Required)
Prefer a simpler setup? Ollama handles everything automatically:
# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from ollama.com
# Then in terminal:
ollama pull gemma4:e4b
ollama run gemma4:e4b