Tutorial

Gemma 4 Guía de Despliegue Local

Ejecuta Google Gemma 4 completamente en tu propio hardware — sin claves API, sin tarifas de uso, privacidad total. Esta guía cubre cuatro métodos: Hugging Face Transformers, Ollama, servidor vLLM y LM Studio.

Ollama Transformers vLLM LM Studio

Requisitos de Hardware

VRAM GPU Mínima

Model	BF16	4-bit Quant
Gemma 4 E2B	9.6 GB	3.2 GB
Gemma 4 E4B	15 GB	5 GB
Gemma 4 31B	58 GB	17 GB
Gemma 4 26B A4B	48 GB	15 GB

Para la mayoría de usuarios con una sola GPU de consumo (RTX 3060–4090), el modelo E4B 4-bit es el punto óptimo.

Qué Método Elegir

Method	Best For
Ollama	Quickest start, no Python needed
LM Studio	GUI, non-technical users
Transformers	Python apps, full API control
vLLM	Production server, OpenAI API

Método 1 — Ollama (Más Fácil)

Ollama es la forma más simple de ejecutar Gemma 4 localmente. Maneja la descarga del modelo, cuantización y servicio automáticamente.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 models
ollama pull gemma4          # 31B default
ollama pull gemma4:e4b      # Edge 4B (lighter)
ollama pull gemma4:e2b      # Edge 2B (lightest)

# Run interactive chat
ollama run gemma4

Consejo: Después de ejecutar ollama run gemma4, you can send requests to the Ollama REST API at http://localhost:11434. It's also compatible with OpenAI client libraries via the /v1/ endpoint.

Método 2 — Hugging Face Transformers

Instalar

pip install -U transformers torch accelerate bitsandbytes

También necesitarás una cuenta de Hugging Face y acceso al modelo en huggingface.co. Ejecuta primero huggingface-cli login first.

Cuantización 4-bit (ahorra VRAM)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

Usa esto cuando tengas VRAM limitada. El modelo E4B corre en ~5 GB con cuantización 4-bit.

Ejemplo Completo de Inferencia BF16

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello, what can you do?"}]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))

Método 3 — vLLM (Servidor de Producción)

vLLM es ideal cuando necesitas alto rendimiento o una API compatible con OpenAI que múltiples clientes puedan consultar simultáneamente.

# Install vLLM
pip install vllm

# Start the server
vllm serve google/gemma-4-31B-it \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Método 4 — LM Studio (GUI)

Para usuarios que prefieren una interfaz gráfica sin comandos de terminal:

1. Download LM Studio from lmstudio.ai
2. Search "gemma4" in the model browser
3. Download your preferred variant (E4B recommended for 8GB VRAM)
4. Click "Load Model" then use the chat interface

LM Studio also exposes a local OpenAI-compatible server at http://localhost:1234/v1.

Problemas Comunes y Soluciones

CUDA Sin Memoria

Switch to a smaller model (E4B instead of 31B)
Use 4-bit quantization (load_in_4bit=True)
Reduce max_new_tokens
Close other GPU-using applications

Velocidad de Generación Lenta

Enable Flash Attention 2: attn_implementation="flash_attention_2"
Use torch.compile(model) on PyTorch 2+
Switch to vLLM for continuous batching
Try GGUF quantized models in Ollama/llama.cpp

Acceso Denegado / Error 403

Accept the model license on Hugging Face
Run huggingface-cli login with your token
Check your HF account has model access

Máquina Solo CPU

Use Ollama with GGUF quantized models (Q4_K_M)
E2B model is viable on modern CPUs (slow but functional)
Consider Google AI Studio for free cloud inference