Tutorial
Gemma 4 Local Deployment Guide
Run Google's Gemma 4 entirely on your own hardware — no API keys, no usage fees, full privacy. This guide covers four methods: Hugging Face Transformers, Ollama, vLLM server, and LM Studio.
Hardware Requirements
Minimum GPU VRAM
| Model | BF16 | 4-bit Quant |
|---|---|---|
| Gemma 4 E2B | 9.6 GB | 3.2 GB |
| Gemma 4 E4B | 15 GB | 5 GB |
| Gemma 4 31B | 58 GB | 17 GB |
| Gemma 4 26B A4B | 48 GB | 15 GB |
For most users with a single consumer GPU (RTX 3060–4090), the E4B 4-bit model is the sweet spot.
Which Method to Choose
| Method | Best For |
|---|---|
| Ollama | Quickest start, no Python needed |
| LM Studio | GUI, non-technical users |
| Transformers | Python apps, full API control |
| vLLM | Production server, OpenAI API |
Method 1 — Ollama (Easiest)
Ollama is the simplest way to run Gemma 4 locally. It handles model downloading, quantization, and serving automatically.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Gemma 4 models
ollama pull gemma4 # 31B default
ollama pull gemma4:e4b # Edge 4B (lighter)
ollama pull gemma4:e2b # Edge 2B (lightest)
# Run interactive chat
ollama run gemma4Tip: After running ollama run gemma4, you can send requests to the Ollama REST API at http://localhost:11434. It's also compatible with OpenAI client libraries via the /v1/ endpoint.
Method 2 — Hugging Face Transformers
Install
pip install -U transformers torch accelerate bitsandbytesYou will also need a Hugging Face account and model access granted at huggingface.co. Run huggingface-cli login first.
4-bit Quantization (saves VRAM)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)Use this when you have limited VRAM. The E4B model runs in ~5 GB with 4-bit quantization.
Full BF16 Inference Example
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
model_id = "google/gemma-4-E4B-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [{"role": "user", "content": "Hello, what can you do?"}]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))Method 3 — vLLM (Production Server)
vLLM is ideal when you need high throughput or want an OpenAI-compatible API that multiple clients can query simultaneously.
# Install vLLM
pip install vllm
# Start the server
vllm serve google/gemma-4-31B-it \
--max-model-len 8192 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
# Test with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-31B-it",
"messages": [{"role": "user", "content": "Hello!"}]
}'Method 4 — LM Studio (GUI)
For users who prefer a graphical interface without any terminal commands:
1. Download LM Studio from lmstudio.ai
2. Search "gemma4" in the model browser
3. Download your preferred variant (E4B recommended for 8GB VRAM)
4. Click "Load Model" then use the chat interfaceLM Studio also exposes a local OpenAI-compatible server at http://localhost:1234/v1.
Common Issues & Fixes
CUDA Out of Memory
- Switch to a smaller model (E4B instead of 31B)
- Use 4-bit quantization (
load_in_4bit=True) - Reduce
max_new_tokens - Close other GPU-using applications
Slow Generation Speed
- Enable Flash Attention 2:
attn_implementation="flash_attention_2" - Use
torch.compile(model)on PyTorch 2+ - Switch to vLLM for continuous batching
- Try GGUF quantized models in Ollama/llama.cpp
Access Denied / 403 Error
- Accept the model license on Hugging Face
- Run
huggingface-cli loginwith your token - Check your HF account has model access
CPU-only Machine
- Use Ollama with GGUF quantized models (Q4_K_M)
- E2B model is viable on modern CPUs (slow but functional)
- Consider Google AI Studio for free cloud inference