G4

Tutorial

Gemma 4 Local Deployment Guide

Run Google's Gemma 4 entirely on your own hardware — no API keys, no usage fees, full privacy. This guide covers four methods: Hugging Face Transformers, Ollama, vLLM server, and LM Studio.

Ollama Transformers vLLM LM Studio

Hardware Requirements

Minimum GPU VRAM

ModelBF164-bit Quant
Gemma 4 E2B9.6 GB3.2 GB
Gemma 4 E4B15 GB5 GB
Gemma 4 31B58 GB17 GB
Gemma 4 26B A4B48 GB15 GB

For most users with a single consumer GPU (RTX 3060–4090), the E4B 4-bit model is the sweet spot.

Which Method to Choose

MethodBest For
OllamaQuickest start, no Python needed
LM StudioGUI, non-technical users
TransformersPython apps, full API control
vLLMProduction server, OpenAI API

Method 1 — Ollama (Easiest)

Ollama is the simplest way to run Gemma 4 locally. It handles model downloading, quantization, and serving automatically.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 models
ollama pull gemma4          # 31B default
ollama pull gemma4:e4b      # Edge 4B (lighter)
ollama pull gemma4:e2b      # Edge 2B (lightest)

# Run interactive chat
ollama run gemma4

Tip: After running ollama run gemma4, you can send requests to the Ollama REST API at http://localhost:11434. It's also compatible with OpenAI client libraries via the /v1/ endpoint.

Method 2 — Hugging Face Transformers

Install

pip install -U transformers torch accelerate bitsandbytes

You will also need a Hugging Face account and model access granted at huggingface.co. Run huggingface-cli login first.

4-bit Quantization (saves VRAM)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

Use this when you have limited VRAM. The E4B model runs in ~5 GB with 4-bit quantization.

Full BF16 Inference Example

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello, what can you do?"}]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

Method 3 — vLLM (Production Server)

vLLM is ideal when you need high throughput or want an OpenAI-compatible API that multiple clients can query simultaneously.

# Install vLLM
pip install vllm

# Start the server
vllm serve google/gemma-4-31B-it \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Method 4 — LM Studio (GUI)

For users who prefer a graphical interface without any terminal commands:

1. Download LM Studio from lmstudio.ai
2. Search "gemma4" in the model browser
3. Download your preferred variant (E4B recommended for 8GB VRAM)
4. Click "Load Model" then use the chat interface

LM Studio also exposes a local OpenAI-compatible server at http://localhost:1234/v1.

Common Issues & Fixes

CUDA Out of Memory

  • Switch to a smaller model (E4B instead of 31B)
  • Use 4-bit quantization (load_in_4bit=True)
  • Reduce max_new_tokens
  • Close other GPU-using applications

Slow Generation Speed

  • Enable Flash Attention 2: attn_implementation="flash_attention_2"
  • Use torch.compile(model) on PyTorch 2+
  • Switch to vLLM for continuous batching
  • Try GGUF quantized models in Ollama/llama.cpp

Access Denied / 403 Error

  • Accept the model license on Hugging Face
  • Run huggingface-cli login with your token
  • Check your HF account has model access

CPU-only Machine

  • Use Ollama with GGUF quantized models (Q4_K_M)
  • E2B model is viable on modern CPUs (slow but functional)
  • Consider Google AI Studio for free cloud inference

Next Steps