튜토리얼

Gemma 4 로컬 배포 가이드

Google Gemma 4를 자신의 하드웨어에서 완전히 실행 — API 키 불필요, 사용 요금 없음, 완전한 프라이버시. 이 가이드는 Hugging Face Transformers, Ollama, vLLM 서버, LM Studio 네 가지 방법을 다룹니다.

Ollama Transformers vLLM LM Studio

하드웨어 요구사항

최소 GPU VRAM

Model	BF16	4-bit Quant
Gemma 4 E2B	9.6 GB	3.2 GB
Gemma 4 E4B	15 GB	5 GB
Gemma 4 31B	58 GB	17 GB
Gemma 4 26B A4B	48 GB	15 GB

단일 소비자 GPU(RTX 3060~4090)를 가진 대부분의 사용자에게는 E4B 4비트 모델이 최적의 균형점입니다.

어떤 방법을 선택할까요

Method	Best For
Ollama	Quickest start, no Python needed
LM Studio	GUI, non-technical users
Transformers	Python apps, full API control
vLLM	Production server, OpenAI API

방법 1 — Ollama (가장 쉬움)

Ollama는 Gemma 4를 로컬에서 실행하는 가장 간단한 방법입니다. 모델 다운로드, 양자화, 서빙을 자동으로 처리합니다.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 models
ollama pull gemma4          # 31B default
ollama pull gemma4:e4b      # Edge 4B (lighter)
ollama pull gemma4:e2b      # Edge 2B (lightest)

# Run interactive chat
ollama run gemma4

팁: ollama run gemma4, you can send requests to the Ollama REST API at http://localhost:11434. It's also compatible with OpenAI client libraries via the /v1/ endpoint.

방법 2 — Hugging Face Transformers

설치

pip install -U transformers torch accelerate bitsandbytes

Hugging Face 계정과 huggingface.co에서 모델 액세스 권한도 필요합니다. 먼저 huggingface-cli login first.

4비트 양자화 (VRAM 절약)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

VRAM이 제한적인 경우 사용하세요. E4B 모델은 4비트 양자화로 ~5GB에서 실행됩니다.

전체 BF16 추론 예제

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello, what can you do?"}]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

방법 3 — vLLM (프로덕션 서버)

vLLM은 높은 처리량이 필요하거나 여러 클라이언트가 동시에 쿼리할 수 있는 OpenAI 호환 API가 필요한 경우 이상적입니다.

# Install vLLM
pip install vllm

# Start the server
vllm serve google/gemma-4-31B-it \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

방법 4 — LM Studio (GUI)

터미널 명령어 없이 그래픽 인터페이스를 선호하는 사용자를 위해:

1. Download LM Studio from lmstudio.ai
2. Search "gemma4" in the model browser
3. Download your preferred variant (E4B recommended for 8GB VRAM)
4. Click "Load Model" then use the chat interface

LM Studio also exposes a local OpenAI-compatible server at http://localhost:1234/v1.

일반 문제 및 해결책

CUDA 메모리 부족

Switch to a smaller model (E4B instead of 31B)
Use 4-bit quantization (load_in_4bit=True)
Reduce max_new_tokens
Close other GPU-using applications

느린 생성 속도

Enable Flash Attention 2: attn_implementation="flash_attention_2"
Use torch.compile(model) on PyTorch 2+
Switch to vLLM for continuous batching
Try GGUF quantized models in Ollama/llama.cpp

액세스 거부 / 403 오류

Accept the model license on Hugging Face
Run huggingface-cli login with your token
Check your HF account has model access

CPU 전용 머신

Use Ollama with GGUF quantized models (Q4_K_M)
E2B model is viable on modern CPUs (slow but functional)
Consider Google AI Studio for free cloud inference