チュートリアル

Gemma 4 ローカルデプロイガイド

Google Gemma 4を自分のハードウェアで完全に実行 — APIキー不要、使用料なし、完全プライバシー。このガイドはHugging Face Transformers、Ollama、vLLMサーバー、LM Studioの4つの方法をカバーします。

Ollama Transformers vLLM LM Studio

ハードウェア要件

最小GPU VRAM

Model	BF16	4-bit Quant
Gemma 4 E2B	9.6 GB	3.2 GB
Gemma 4 E4B	15 GB	5 GB
Gemma 4 31B	58 GB	17 GB
Gemma 4 26B A4B	48 GB	15 GB

単一のコンシューマーGPU（RTX 3060〜4090）を持つほとんどのユーザーには、E4B 4ビットモデルが最適なバランスです。

どの方法を選ぶか

Method	Best For
Ollama	Quickest start, no Python needed
LM Studio	GUI, non-technical users
Transformers	Python apps, full API control
vLLM	Production server, OpenAI API

方法1 — Ollama（最も簡単）

Ollamaは、Gemma 4をローカルで実行する最も簡単な方法です。モデルのダウンロード、量化、サービングを自動的に処理します。

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 models
ollama pull gemma4          # 31B default
ollama pull gemma4:e4b      # Edge 4B (lighter)
ollama pull gemma4:e2b      # Edge 2B (lightest)

# Run interactive chat
ollama run gemma4

ヒント： ollama run gemma4, you can send requests to the Ollama REST API at http://localhost:11434. It's also compatible with OpenAI client libraries via the /v1/ endpoint.

方法2 — Hugging Face Transformers

インストール

pip install -U transformers torch accelerate bitsandbytes

Hugging Faceアカウントとhuggingface.coでのモデルアクセス許可も必要です。まず huggingface-cli login first.

4ビット量化（VRAMを節約）

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

VRAMが限られている場合に使用。E4Bモデルは4ビット量化で約5GBで動作します。

BF16完全推理の例

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello, what can you do?"}]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

方法3 — vLLM（本番サーバー）

vLLMは、高スループットが必要な場合や、複数のクライアントが同時に照会できるOpenAI互換APIが必要な場合に最適です。

# Install vLLM
pip install vllm

# Start the server
vllm serve google/gemma-4-31B-it \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

方法4 — LM Studio（GUI）

ターミナルコマンドなしにグラフィカルインターフェースを好むユーザー向け：

1. Download LM Studio from lmstudio.ai
2. Search "gemma4" in the model browser
3. Download your preferred variant (E4B recommended for 8GB VRAM)
4. Click "Load Model" then use the chat interface

LM Studio also exposes a local OpenAI-compatible server at http://localhost:1234/v1.

一般的な問題と解決策

CUDAメモリ不足

Switch to a smaller model (E4B instead of 31B)
Use 4-bit quantization (load_in_4bit=True)
Reduce max_new_tokens
Close other GPU-using applications

生成速度が遅い

Enable Flash Attention 2: attn_implementation="flash_attention_2"
Use torch.compile(model) on PyTorch 2+
Switch to vLLM for continuous batching
Try GGUF quantized models in Ollama/llama.cpp

アクセス拒否 / 403エラー

Accept the model license on Hugging Face
Run huggingface-cli login with your token
Check your HF account has model access

CPUのみのマシン

Use Ollama with GGUF quantized models (Q4_K_M)
E2B model is viable on modern CPUs (slow but functional)
Consider Google AI Studio for free cloud inference