G4

Mobile Guide

Gemma 4 for Android

Run Google's Gemma 4 directly on your Android phone — fully offline, zero cloud costs, complete data privacy. The E2B edge model delivers real-time AI on any flagship Android with a Snapdragon 8 Gen 2 or newer.

Android Kotlin Vulkan GPU On-Device Offline

Why Run Gemma 4 on Android?

  • Full privacy — prompts and responses never leave your device
  • Works offline — no Wi-Fi or mobile data needed
  • Free after download — no API bills, no subscriptions
  • Instant response on Snapdragon 8 Gen 3 / Dimensity 9300

Which Model to Use

ModelStorageActive RAMSpeed (SD 8 Gen 3)Recommendation
Gemma 4 E2B Q4_K_M~2.4 GB~2.5 GB14–20 tok/sBest for Android
Gemma 4 E2B Q8~4.4 GB~4.6 GB9–13 tok/sHigher quality
Gemma 4 E4B Q4_K_M~4.2 GB~4.5 GB8–12 tok/sLarger model

E2B Q4_K_M is the recommended starting point for Android. It fits in ~2.4 GB of storage, uses ~2.5 GB RAM during inference, and runs at practical speeds on any flagship chipset from 2023 onward.

Method 1 — Google AI Edge / MediaPipe (Kotlin)

The official Google path. The AI Edge SDK integrates directly with Android Studio and uses GPU delegate for hardware-accelerated inference. Ideal for building your own app.

build.gradle

// build.gradle (app)
dependencies {
    implementation("com.google.ai.edge.litert:litert:1.0.1")
    implementation("com.google.ai.edge.litert:litert-gpu:1.0.1")
}

Kotlin Inference

import com.google.ai.edge.litert.LiteRtSession

// Initialize with E2B model asset
val session = LiteRtSession.createFromAsset(
    context,
    "gemma4-e2b-it-q4.task",
    LiteRtSession.Options.builder()
        .setPreferredBackend(LiteRtSession.Backend.GPU)
        .build()
)

// Generate response
val result = session.generateResponse("What is Gemma 4?")
println(result)

Method 2 — MLC-LLM Android App

MLC-LLM provides a pre-built Android APK that runs Gemma 4 GGUF models using Vulkan GPU acceleration. Download the APK from the MLC-LLM GitHub releases, install it, then load the Gemma 4 E2B Q4 model from within the app. No coding required — works on any Android 10+ device with a modern GPU.

Step 1

Download MLC-LLM APK from GitHub Releases page

Step 2

Install APK (enable "Unknown sources" in settings)

Step 3

Open app → Add Model → select Gemma 4 E2B Q4

Method 3 — Termux + llama.cpp (Power Users)

For developers who want full control. Install Termux from F-Droid, compile llama.cpp with OpenCL or Vulkan support, download the GGUF model, and run inference from the terminal. Slower setup but gives you direct access to all llama.cpp flags including context length, temperature, and batch size.

# Install Termux from F-Droid (not Play Store)
# Then inside Termux:
pkg update && pkg install clang cmake git

# Clone and build llama.cpp with Vulkan
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j4

# Download Gemma 4 E2B Q4_K_M GGUF
# (copy to phone storage or wget from Hugging Face)

# Run inference
./build/bin/llama-cli \
  -m /sdcard/gemma4-e2b-q4_k_m.gguf \
  -p "Explain machine learning simply" \
  -n 200 --gpu-layers 99

Android Hardware Requirements

ChipsetGPU APIE2B Q4 SpeedDevices
Snapdragon 8 EliteVulkan — Full18–25 tok/sGalaxy S25, Xiaomi 15
Snapdragon 8 Gen 3Vulkan — Full14–20 tok/sGalaxy S24, OnePlus 12
Snapdragon 8 Gen 2Vulkan — Full10–15 tok/sGalaxy S23, Pixel 8 Pro
Dimensity 9300 / 9400Vulkan — Full12–18 tok/sXiaomi 14, vivo X100
Snapdragon 7s Gen 2CPU fallback3–5 tok/sMid-range devices
Older chipsetsCPU only1–3 tok/sNot recommended

Mid-range chipsets (Snapdragon 7s Gen 2 and below) can run E2B Q4 via CPU but expect 2–4 tokens/sec. Flagship chips with Vulkan/OpenCL GPU compute deliver 10–20+ tokens/sec.

Performance Benchmarks

TaskPixel 8 Pro (SD 8 Gen 2)Galaxy S24 (SD 8 Gen 3)
Text generation (tok/s)1318
First token latency~0.5s~0.3s
512→512 token throughput10 tok/s15 tok/s
RAM usage (peak)2.9 GB2.7 GB
Battery drain (per hour)~20%~16%

Tokens/sec measured with Gemma 4 E2B Q4_K_M, 512-token prompt. GPU path uses Vulkan compute via MLC-LLM or AI Edge SDK.

Tips for Best Performance

  • Enable GPU acceleration — Vulkan delegate is 4–6× faster than CPU
  • Use Q4_K_M quantization for best quality/speed on mobile
  • Set Android performance mode (Settings → Battery → Performance) before long sessions
  • Keep prompts under 2K tokens — longer contexts slow first-token latency significantly

Related