Mobile Guide
Gemma 4 for iPhone
Run Google's Gemma 4 directly on your iPhone — no cloud, no subscription, full privacy. The E2B edge model fits in under 5 GB and delivers real-time inference on any iPhone 15 Pro or newer.
Why Run Gemma 4 on iPhone?
- Complete privacy — your data never leaves the device
- Works fully offline, no internet required
- Zero API costs — runs free once downloaded
- Sub-second response on A17 Pro / A18 chips
Which Model to Use
| Model | Download | Active RAM | Speed (A17 Pro) | Recommendation |
|---|---|---|---|---|
| Gemma 4 E2B Q4_K_M | ~2.4 GB | ~2.5 GB | 12–18 tok/s | Best for iPhone |
| Gemma 4 E2B BF16 | ~4.6 GB | ~4.8 GB | 6–10 tok/s | Max quality |
| Gemma 4 E4B Q4_K_M | ~4.2 GB | ~4.5 GB | 7–11 tok/s | Higher quality |
The E2B model is the top recommendation for iPhone. At ~4.6 GB download it fits within iPhone storage comfortably, and 4-bit quantization reduces active RAM to ~2.5 GB — well within the memory budget of iPhone 15/16 series.
Method 1 — Google AI Edge SDK (Swift)
The official path. Google AI Edge provides a native Swift package that runs Gemma 4 using Metal GPU acceleration.
- Add the Google AI Edge Swift package to your Xcode project
- Download the E2B model weights in TensorFlow Lite / LiteRT format
- Initialize the session and call generate()
Package.swift
// Package.swift
dependencies: [
.package(
url: "https://github.com/google/generative-ai-swift",
from: "0.5.0"
)
]Swift Inference
import GoogleAIEdge
// Load E2B model (place .task file in app bundle)
let modelPath = Bundle.main.path(forResource: "gemma4-e2b-it-q4", ofType: "task")!
let session = try LlmInference(modelPath: modelPath)
// Generate response
let response = try await session.generateResponse(
inputText: "Explain quantum computing in simple terms."
)
print(response)Method 2 — GGUF via llama.cpp (LLM Farm / Offline Chat)
For users who prefer a ready-made iOS app. LLM Farm and Offline Chat on the App Store both use llama.cpp under the hood and support Gemma GGUF models. Download the Q4_K_M quantized GGUF from Hugging Face, import it in the app, and start chatting. No code required.
LLM Farm
Free, open-source, supports custom GGUF import. Available on App Store.
Offline Chat
Simple UI, built-in model browser, supports Gemma GGUF natively.
Method 3 — Ollama on Mac, Access from iPhone
Already running Ollama on a Mac? Expose the API on your local network and connect from the iPhone using any OpenAI-compatible app (e.g. Enchanted, OllamaChat). Set OLLAMA_HOST=0.0.0.0 on the Mac, then point the app to http://your-mac-ip:11434.
# On your Mac — allow LAN access
export OLLAMA_HOST=0.0.0.0
ollama serve
# Pull Gemma 4 E2B if not already done
ollama pull gemma4:e2b
# iPhone app settings:
# Base URL: http://192.168.x.x:11434 (your Mac's local IP)
# Model: gemma4:e2biPhone Hardware Requirements
| iPhone | Chip | GPU Acceleration | E2B Q4 Speed |
|---|---|---|---|
| iPhone 16 Pro / Max | A18 Pro | Metal — Full | 18–24 tok/s |
| iPhone 16 / Plus | A18 | Metal — Full | 15–20 tok/s |
| iPhone 15 Pro / Max | A17 Pro | Metal — Full | 12–18 tok/s |
| iPhone 15 / Plus | A16 | Metal — Partial | 6–10 tok/s |
| iPhone 14 series | A15 | CPU fallback | 3–5 tok/s |
| iPhone 13 and older | A15 / A14 | CPU only | 2–3 tok/s |
Older iPhones can run the E2B model via CPU inference but will be slow (2–4 tokens/sec). iPhone 15 Pro and later with the A17 Pro chip get Metal GPU acceleration and deliver practical real-time speed.
Performance Benchmarks
| Task | iPhone 15 Pro (A17) | iPhone 16 Pro (A18 Pro) |
|---|---|---|
| Text generation (tok/s) | 14 | 21 |
| First token latency | ~0.4s | ~0.25s |
| 512→512 token throughput | 11 tok/s | 17 tok/s |
| RAM usage (peak) | 2.8 GB | 2.6 GB |
| Battery drain (per hour) | ~18% | ~14% |
Tokens/sec measured with Gemma 4 E2B Q4_K_M, 512-token prompt, using Metal GPU. CPU-only fallback is 3–5× slower.
Tips for Best Performance
- Use Q4_K_M quantization — best quality/speed trade-off for mobile
- Close other apps before running inference to free RAM
- Keep prompts under 2K tokens for fastest response on phone
- Plug in charger during extended sessions — inference is GPU-intensive