教程

Gemma 4 本地部署指南

完全在自己的硬件上运行 Google Gemma 4——无需 API 密钥、无使用费用、完全私密。本指南涵盖四种方式：Hugging Face Transformers、Ollama、vLLM 服务器和 LM Studio。

Ollama Transformers vLLM LM Studio

硬件要求

最低 GPU 显存

Model	BF16	4-bit Quant
Gemma 4 E2B	9.6 GB	3.2 GB
Gemma 4 E4B	15 GB	5 GB
Gemma 4 31B	58 GB	17 GB
Gemma 4 26B A4B	48 GB	15 GB

对于大多数拥有单张消费级 GPU（RTX 3060–4090）的用户，E4B 4位量化模型是最佳平衡点。

选择哪种方式

Method	Best For
Ollama	Quickest start, no Python needed
LM Studio	GUI, non-technical users
Transformers	Python apps, full API control
vLLM	Production server, OpenAI API

方法一 — Ollama（最简单）

Ollama 是本地运行 Gemma 4 最简便的方式，自动处理模型下载、量化和服务配置。

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 models
ollama pull gemma4          # 31B default
ollama pull gemma4:e4b      # Edge 4B (lighter)
ollama pull gemma4:e2b      # Edge 2B (lightest)

# Run interactive chat
ollama run gemma4

提示：运行 ollama run gemma4, you can send requests to the Ollama REST API at http://localhost:11434. It's also compatible with OpenAI client libraries via the /v1/ endpoint.

方法二 — Hugging Face Transformers

安装

pip install -U transformers torch accelerate bitsandbytes

您还需要 Hugging Face 账号并在 huggingface.co 获得模型访问权限。先运行 huggingface-cli login first.

4位量化（节省显存）

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

显存有限时使用。E4B 模型在4位量化下约需5GB显存。

BF16 完整推理示例

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello, what can you do?"}]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

方法三 — vLLM（生产服务器）

vLLM 适合需要高吞吐量或需要多客户端同时查询的 OpenAI 兼容 API 的场景。

# Install vLLM
pip install vllm

# Start the server
vllm serve google/gemma-4-31B-it \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

方法四 — LM Studio（图形界面）

对于偏好图形界面、不想使用命令行的用户：

1. Download LM Studio from lmstudio.ai
2. Search "gemma4" in the model browser
3. Download your preferred variant (E4B recommended for 8GB VRAM)
4. Click "Load Model" then use the chat interface

LM Studio also exposes a local OpenAI-compatible server at http://localhost:1234/v1.

常见问题与解决方案

CUDA 显存不足

Switch to a smaller model (E4B instead of 31B)
Use 4-bit quantization (load_in_4bit=True)
Reduce max_new_tokens
Close other GPU-using applications

生成速度慢

Enable Flash Attention 2: attn_implementation="flash_attention_2"
Use torch.compile(model) on PyTorch 2+
Switch to vLLM for continuous batching
Try GGUF quantized models in Ollama/llama.cpp

访问被拒绝 / 403 错误

Accept the model license on Hugging Face
Run huggingface-cli login with your token
Check your HF account has model access

仅有 CPU 的机器

Use Ollama with GGUF quantized models (Q4_K_M)
E2B model is viable on modern CPUs (slow but functional)
Consider Google AI Studio for free cloud inference