G4
教程
Gemma 4 本地部署指南
完全在自己的硬件上运行 Google Gemma 4——无需 API 密钥、无使用费用、完全私密。本指南涵盖四种方式:Hugging Face Transformers、Ollama、vLLM 服务器和 LM Studio。
Ollama Transformers vLLM LM Studio
硬件要求
最低 GPU 显存
| Model | BF16 | 4-bit Quant |
|---|---|---|
| Gemma 4 E2B | 9.6 GB | 3.2 GB |
| Gemma 4 E4B | 15 GB | 5 GB |
| Gemma 4 31B | 58 GB | 17 GB |
| Gemma 4 26B A4B | 48 GB | 15 GB |
对于大多数拥有单张消费级 GPU(RTX 3060–4090)的用户,E4B 4位量化模型是最佳平衡点。
选择哪种方式
| Method | Best For |
|---|---|
| Ollama | Quickest start, no Python needed |
| LM Studio | GUI, non-technical users |
| Transformers | Python apps, full API control |
| vLLM | Production server, OpenAI API |
方法一 — Ollama(最简单)
Ollama 是本地运行 Gemma 4 最简便的方式,自动处理模型下载、量化和服务配置。
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Gemma 4 models
ollama pull gemma4 # 31B default
ollama pull gemma4:e4b # Edge 4B (lighter)
ollama pull gemma4:e2b # Edge 2B (lightest)
# Run interactive chat
ollama run gemma4提示:运行 ollama run gemma4, you can send requests to the Ollama REST API at http://localhost:11434. It's also compatible with OpenAI client libraries via the /v1/ endpoint.
方法二 — Hugging Face Transformers
安装
pip install -U transformers torch accelerate bitsandbytes您还需要 Hugging Face 账号并在 huggingface.co 获得模型访问权限。先运行 huggingface-cli login first.
4位量化(节省显存)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)显存有限时使用。E4B 模型在4位量化下约需5GB显存。
BF16 完整推理示例
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
model_id = "google/gemma-4-E4B-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [{"role": "user", "content": "Hello, what can you do?"}]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))方法三 — vLLM(生产服务器)
vLLM 适合需要高吞吐量或需要多客户端同时查询的 OpenAI 兼容 API 的场景。
# Install vLLM
pip install vllm
# Start the server
vllm serve google/gemma-4-31B-it \
--max-model-len 8192 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
# Test with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-31B-it",
"messages": [{"role": "user", "content": "Hello!"}]
}'方法四 — LM Studio(图形界面)
对于偏好图形界面、不想使用命令行的用户:
1. Download LM Studio from lmstudio.ai
2. Search "gemma4" in the model browser
3. Download your preferred variant (E4B recommended for 8GB VRAM)
4. Click "Load Model" then use the chat interfaceLM Studio also exposes a local OpenAI-compatible server at http://localhost:1234/v1.
常见问题与解决方案
CUDA 显存不足
- Switch to a smaller model (E4B instead of 31B)
- Use 4-bit quantization (
load_in_4bit=True) - Reduce
max_new_tokens - Close other GPU-using applications
生成速度慢
- Enable Flash Attention 2:
attn_implementation="flash_attention_2" - Use
torch.compile(model)on PyTorch 2+ - Switch to vLLM for continuous batching
- Try GGUF quantized models in Ollama/llama.cpp
访问被拒绝 / 403 错误
- Accept the model license on Hugging Face
- Run
huggingface-cli loginwith your token - Check your HF account has model access
仅有 CPU 的机器
- Use Ollama with GGUF quantized models (Q4_K_M)
- E2B model is viable on modern CPUs (slow but functional)
- Consider Google AI Studio for free cloud inference