Mobile Guide
Gemma 4 for Android
Run Google's Gemma 4 directly on your Android phone — fully offline, zero cloud costs, complete data privacy. The E2B edge model delivers real-time AI on any flagship Android with a Snapdragon 8 Gen 2 or newer.
Why Run Gemma 4 on Android?
- Full privacy — prompts and responses never leave your device
- Works offline — no Wi-Fi or mobile data needed
- Free after download — no API bills, no subscriptions
- Instant response on Snapdragon 8 Gen 3 / Dimensity 9300
Which Model to Use
| Model | Storage | Active RAM | Speed (SD 8 Gen 3) | Recommendation |
|---|---|---|---|---|
| Gemma 4 E2B Q4_K_M | ~2.4 GB | ~2.5 GB | 14–20 tok/s | Best for Android |
| Gemma 4 E2B Q8 | ~4.4 GB | ~4.6 GB | 9–13 tok/s | Higher quality |
| Gemma 4 E4B Q4_K_M | ~4.2 GB | ~4.5 GB | 8–12 tok/s | Larger model |
E2B Q4_K_M is the recommended starting point for Android. It fits in ~2.4 GB of storage, uses ~2.5 GB RAM during inference, and runs at practical speeds on any flagship chipset from 2023 onward.
Method 1 — Google AI Edge / MediaPipe (Kotlin)
The official Google path. The AI Edge SDK integrates directly with Android Studio and uses GPU delegate for hardware-accelerated inference. Ideal for building your own app.
build.gradle
// build.gradle (app)
dependencies {
implementation("com.google.ai.edge.litert:litert:1.0.1")
implementation("com.google.ai.edge.litert:litert-gpu:1.0.1")
}Kotlin Inference
import com.google.ai.edge.litert.LiteRtSession
// Initialize with E2B model asset
val session = LiteRtSession.createFromAsset(
context,
"gemma4-e2b-it-q4.task",
LiteRtSession.Options.builder()
.setPreferredBackend(LiteRtSession.Backend.GPU)
.build()
)
// Generate response
val result = session.generateResponse("What is Gemma 4?")
println(result)Method 2 — MLC-LLM Android App
MLC-LLM provides a pre-built Android APK that runs Gemma 4 GGUF models using Vulkan GPU acceleration. Download the APK from the MLC-LLM GitHub releases, install it, then load the Gemma 4 E2B Q4 model from within the app. No coding required — works on any Android 10+ device with a modern GPU.
Step 1
Download MLC-LLM APK from GitHub Releases page
Step 2
Install APK (enable "Unknown sources" in settings)
Step 3
Open app → Add Model → select Gemma 4 E2B Q4
Method 3 — Termux + llama.cpp (Power Users)
For developers who want full control. Install Termux from F-Droid, compile llama.cpp with OpenCL or Vulkan support, download the GGUF model, and run inference from the terminal. Slower setup but gives you direct access to all llama.cpp flags including context length, temperature, and batch size.
# Install Termux from F-Droid (not Play Store)
# Then inside Termux:
pkg update && pkg install clang cmake git
# Clone and build llama.cpp with Vulkan
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j4
# Download Gemma 4 E2B Q4_K_M GGUF
# (copy to phone storage or wget from Hugging Face)
# Run inference
./build/bin/llama-cli \
-m /sdcard/gemma4-e2b-q4_k_m.gguf \
-p "Explain machine learning simply" \
-n 200 --gpu-layers 99Android Hardware Requirements
| Chipset | GPU API | E2B Q4 Speed | Devices |
|---|---|---|---|
| Snapdragon 8 Elite | Vulkan — Full | 18–25 tok/s | Galaxy S25, Xiaomi 15 |
| Snapdragon 8 Gen 3 | Vulkan — Full | 14–20 tok/s | Galaxy S24, OnePlus 12 |
| Snapdragon 8 Gen 2 | Vulkan — Full | 10–15 tok/s | Galaxy S23, Pixel 8 Pro |
| Dimensity 9300 / 9400 | Vulkan — Full | 12–18 tok/s | Xiaomi 14, vivo X100 |
| Snapdragon 7s Gen 2 | CPU fallback | 3–5 tok/s | Mid-range devices |
| Older chipsets | CPU only | 1–3 tok/s | Not recommended |
Mid-range chipsets (Snapdragon 7s Gen 2 and below) can run E2B Q4 via CPU but expect 2–4 tokens/sec. Flagship chips with Vulkan/OpenCL GPU compute deliver 10–20+ tokens/sec.
Performance Benchmarks
| Task | Pixel 8 Pro (SD 8 Gen 2) | Galaxy S24 (SD 8 Gen 3) |
|---|---|---|
| Text generation (tok/s) | 13 | 18 |
| First token latency | ~0.5s | ~0.3s |
| 512→512 token throughput | 10 tok/s | 15 tok/s |
| RAM usage (peak) | 2.9 GB | 2.7 GB |
| Battery drain (per hour) | ~20% | ~16% |
Tokens/sec measured with Gemma 4 E2B Q4_K_M, 512-token prompt. GPU path uses Vulkan compute via MLC-LLM or AI Edge SDK.
Tips for Best Performance
- Enable GPU acceleration — Vulkan delegate is 4–6× faster than CPU
- Use Q4_K_M quantization for best quality/speed on mobile
- Set Android performance mode (Settings → Battery → Performance) before long sessions
- Keep prompts under 2K tokens — longer contexts slow first-token latency significantly