Tutorial

Gemma 4 Local Deployment Guide

Run Google's Gemma 4 entirely on your own hardware — no API keys, no usage fees, full privacy. This guide covers four methods: Hugging Face Transformers, Ollama, vLLM server, and LM Studio.

Ollama Transformers vLLM LM Studio

Hardware Requirements

Minimum GPU VRAM

Model	BF16	4-bit Quant
Gemma 4 E2B	9.6 GB	3.2 GB
Gemma 4 E4B	15 GB	5 GB
Gemma 4 31B	58 GB	17 GB
Gemma 4 26B A4B	48 GB	15 GB

For most users with a single consumer GPU (RTX 3060–4090), the E4B 4-bit model is the sweet spot.

Which Method to Choose

Method	Best For
Ollama	Quickest start, no Python needed
LM Studio	GUI, non-technical users
Transformers	Python apps, full API control
vLLM	Production server, OpenAI API

Method 1 — Ollama (Easiest)

Ollama is the simplest way to run Gemma 4 locally. It handles model downloading, quantization, and serving automatically.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 models
ollama pull gemma4          # 31B default
ollama pull gemma4:e4b      # Edge 4B (lighter)
ollama pull gemma4:e2b      # Edge 2B (lightest)

# Run interactive chat
ollama run gemma4

Tip: After running ollama run gemma4, you can send requests to the Ollama REST API at http://localhost:11434. It's also compatible with OpenAI client libraries via the /v1/ endpoint.

Method 2 — Hugging Face Transformers

Install

pip install -U transformers torch accelerate bitsandbytes

You will also need a Hugging Face account and model access granted at huggingface.co. Run huggingface-cli login first.

4-bit Quantization (saves VRAM)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

Use this when you have limited VRAM. The E4B model runs in ~5 GB with 4-bit quantization.

Full BF16 Inference Example

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello, what can you do?"}]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

Method 3 — vLLM (Production Server)

vLLM is ideal when you need high throughput or want an OpenAI-compatible API that multiple clients can query simultaneously.

# Install vLLM
pip install vllm

# Start the server
vllm serve google/gemma-4-31B-it \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Method 4 — LM Studio (GUI)

For users who prefer a graphical interface without any terminal commands:

1. Download LM Studio from lmstudio.ai
2. Search "gemma4" in the model browser
3. Download your preferred variant (E4B recommended for 8GB VRAM)
4. Click "Load Model" then use the chat interface

LM Studio also exposes a local OpenAI-compatible server at http://localhost:1234/v1.

Common Issues & Fixes

CUDA Out of Memory

Switch to a smaller model (E4B instead of 31B)
Use 4-bit quantization (load_in_4bit=True)
Reduce max_new_tokens
Close other GPU-using applications

Slow Generation Speed

Enable Flash Attention 2: attn_implementation="flash_attention_2"
Use torch.compile(model) on PyTorch 2+
Switch to vLLM for continuous batching
Try GGUF quantized models in Ollama/llama.cpp

Access Denied / 403 Error

Accept the model license on Hugging Face
Run huggingface-cli login with your token
Check your HF account has model access

CPU-only Machine

Use Ollama with GGUF quantized models (Q4_K_M)
E2B model is viable on modern CPUs (slow but functional)
Consider Google AI Studio for free cloud inference