튜토리얼

Gemma 4 설치 튜토리얼

Python 환경 설정부터 첫 번째 추론 실행까지 Gemma 4를 머신에 설치하고 실행하는 단계별 가이드. Python SDK와 Ollama 모두를 다룹니다.

Python pip Ollama CUDA

사전 요구사항

시스템 요구사항

OS	Linux, macOS, or Windows (WSL2)
Python	3.9 or higher (3.11 recommended)
GPU	NVIDIA with 6 GB+ VRAM (optional but recommended)
CUDA	12.1+ (if using GPU)
RAM	16 GB+ system RAM
Disk	20–60 GB free space per model

Hugging Face 계정

huggingface.co에서 무료 계정 생성
모델 페이지 방문 (예: google/gemma-4-E4B-it)
"저장소 액세스"를 클릭하고 라이선스에 동의
Settings → Access Tokens에서 읽기 토큰 생성

모델 액세스는 무료 — Google은 라이선스 동의만 요구합니다.

1단계 — Python 환경 설정

Conda 사용 (권장)

# Create a dedicated Python environment
conda create -n gemma4 python=3.11 -y
conda activate gemma4

venv 사용

python -m venv gemma4-env
# Linux/macOS:
source gemma4-env/bin/activate
# Windows:
gemma4-env\Scripts\activate

2단계 — 의존성 설치

# Core dependencies
pip install -U transformers torch accelerate

# Optional: quantization support
pip install bitsandbytes

# Optional: faster inference
pip install flash-attn --no-build-isolation

GPU 사용자의 경우: PyTorch는 사용 가능한 경우 자동으로 CUDA를 선택합니다. 다음으로 확인: python -c "import torch; print(torch.cuda.is_available())".

3단계 — Hugging Face 인증

# Install Hugging Face CLI
pip install huggingface_hub

# Authenticate (get your token at huggingface.co/settings/tokens)
huggingface-cli login

4단계 — 모델 다운로드

Python으로 다운로드

# Download the E4B model (recommended for 8-16 GB VRAM)
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="google/gemma-4-E4B-it",
    local_dir="./models/gemma4-e4b"
)

모델 크기 참조

Model	Download Size
E2B	~4.6 GB
E4B	~8.0 GB
31B	~58 GB
26B A4B	~48 GB

5단계 — 설치 확인

import torch
from transformers import AutoProcessor, AutoModelForCausalLM

print("PyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    print("VRAM:", round(torch.cuda.get_device_properties(0).total_memory / 1e9, 1), "GB")

# Quick load test
processor = AutoProcessor.from_pretrained("google/gemma-4-E4B-it")
print("Processor loaded OK")

6단계 — 첫 번째 추론 실행

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Explain what Gemma 4 is in 2 sentences."}]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

대안 — Ollama (Python 불필요)

더 간단한 설정을 원하시나요? Ollama가 모든 것을 자동으로 처리합니다:

# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com
# Then in terminal:
ollama pull gemma4:e4b
ollama run gemma4:e4b