Mobile Guide

Gemma 4 for Android

Run Google's Gemma 4 directly on your Android phone — fully offline, zero cloud costs, complete data privacy. The E2B edge model delivers real-time AI on any flagship Android with a Snapdragon 8 Gen 2 or newer.

Android Kotlin Vulkan GPU On-Device Offline

Why Run Gemma 4 on Android?

Full privacy — prompts and responses never leave your device
Works offline — no Wi-Fi or mobile data needed
Free after download — no API bills, no subscriptions
Instant response on Snapdragon 8 Gen 3 / Dimensity 9300

Which Model to Use

Model	Storage	Active RAM	Speed (SD 8 Gen 3)	Recommendation
Gemma 4 E2B Q4_K_M	~2.4 GB	~2.5 GB	14–20 tok/s	Best for Android
Gemma 4 E2B Q8	~4.4 GB	~4.6 GB	9–13 tok/s	Higher quality
Gemma 4 E4B Q4_K_M	~4.2 GB	~4.5 GB	8–12 tok/s	Larger model

E2B Q4_K_M is the recommended starting point for Android. It fits in ~2.4 GB of storage, uses ~2.5 GB RAM during inference, and runs at practical speeds on any flagship chipset from 2023 onward.

Method 1 — Google AI Edge / MediaPipe (Kotlin)

The official Google path. The AI Edge SDK integrates directly with Android Studio and uses GPU delegate for hardware-accelerated inference. Ideal for building your own app.

build.gradle

// build.gradle (app)
dependencies {
    implementation("com.google.ai.edge.litert:litert:1.0.1")
    implementation("com.google.ai.edge.litert:litert-gpu:1.0.1")
}

Kotlin Inference

import com.google.ai.edge.litert.LiteRtSession

// Initialize with E2B model asset
val session = LiteRtSession.createFromAsset(
    context,
    "gemma4-e2b-it-q4.task",
    LiteRtSession.Options.builder()
        .setPreferredBackend(LiteRtSession.Backend.GPU)
        .build()
)

// Generate response
val result = session.generateResponse("What is Gemma 4?")
println(result)

Method 2 — MLC-LLM Android App

MLC-LLM provides a pre-built Android APK that runs Gemma 4 GGUF models using Vulkan GPU acceleration. Download the APK from the MLC-LLM GitHub releases, install it, then load the Gemma 4 E2B Q4 model from within the app. No coding required — works on any Android 10+ device with a modern GPU.

Step 1

Download MLC-LLM APK from GitHub Releases page

Step 2

Install APK (enable "Unknown sources" in settings)

Step 3

Open app → Add Model → select Gemma 4 E2B Q4

Method 3 — Termux + llama.cpp (Power Users)

For developers who want full control. Install Termux from F-Droid, compile llama.cpp with OpenCL or Vulkan support, download the GGUF model, and run inference from the terminal. Slower setup but gives you direct access to all llama.cpp flags including context length, temperature, and batch size.

# Install Termux from F-Droid (not Play Store)
# Then inside Termux:
pkg update && pkg install clang cmake git

# Clone and build llama.cpp with Vulkan
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j4

# Download Gemma 4 E2B Q4_K_M GGUF
# (copy to phone storage or wget from Hugging Face)

# Run inference
./build/bin/llama-cli \
  -m /sdcard/gemma4-e2b-q4_k_m.gguf \
  -p "Explain machine learning simply" \
  -n 200 --gpu-layers 99

Android Hardware Requirements

Chipset	GPU API	E2B Q4 Speed	Devices
Snapdragon 8 Elite	Vulkan — Full	18–25 tok/s	Galaxy S25, Xiaomi 15
Snapdragon 8 Gen 3	Vulkan — Full	14–20 tok/s	Galaxy S24, OnePlus 12
Snapdragon 8 Gen 2	Vulkan — Full	10–15 tok/s	Galaxy S23, Pixel 8 Pro
Dimensity 9300 / 9400	Vulkan — Full	12–18 tok/s	Xiaomi 14, vivo X100
Snapdragon 7s Gen 2	CPU fallback	3–5 tok/s	Mid-range devices
Older chipsets	CPU only	1–3 tok/s	Not recommended

Mid-range chipsets (Snapdragon 7s Gen 2 and below) can run E2B Q4 via CPU but expect 2–4 tokens/sec. Flagship chips with Vulkan/OpenCL GPU compute deliver 10–20+ tokens/sec.

Performance Benchmarks

Task	Pixel 8 Pro (SD 8 Gen 2)	Galaxy S24 (SD 8 Gen 3)
Text generation (tok/s)	13	18
First token latency	~0.5s	~0.3s
512→512 token throughput	10 tok/s	15 tok/s
RAM usage (peak)	2.9 GB	2.7 GB
Battery drain (per hour)	~20%	~16%

Tokens/sec measured with Gemma 4 E2B Q4_K_M, 512-token prompt. GPU path uses Vulkan compute via MLC-LLM or AI Edge SDK.

Tips for Best Performance

Enable GPU acceleration — Vulkan delegate is 4–6× faster than CPU
Use Q4_K_M quantization for best quality/speed on mobile
Set Android performance mode (Settings → Battery → Performance) before long sessions
Keep prompts under 2K tokens — longer contexts slow first-token latency significantly