llamafu vs llama.rn vs flutter_llama_cpp: On-Device LLM for Flutter in 2026
A practical comparison of llamafu, llama.rn, and flutter_llama_cpp for running LLMs on Flutter — what each does well, what each doesn't, and which to pick.
The question
Should I use llamafu, llama.rn, or flutter_llama_cpp to run an LLM in my Flutter app?
The Flutter-on-device-LLM ecosystem has matured in 2025–2026. There are now three serious contenders, and the differences are real but not always obvious. This post is the comparison we wished we had when we started llamafu.
The 60-second version: llamafu is the research-instrumented runtime; llama.rn is the React-Native bridge; flutter_llama_cpp is the community-maintained FFI binding. They share the same inference engine (llama.cpp) but differ in measurement, ergonomics, and Flutter-native integration.
What each project is
llamafu is a Flutter runtime built around llama.cpp with a measurement-first methodology. Every inference is instrumented; token/s per device, memory ceilings, KV-cache utilisation, and quantisation quality trade-offs are all first-class outputs. The goal is to publish the measurements so the community can compare deployments honestly.
llama.rn is the React Native equivalent — a JS-callable binding to llama.cpp, with a smaller API surface focused on getting models running quickly. It is the de-facto choice for cross-platform mobile (iOS, Android) when the app is RN-based.
flutter_llama_cpp is a community-maintained FFI binding that exposes llama.cpp to Dart directly. It is the most Dart-native of the three and the most configurable at the engine level (KV cache, sampling, batching).
The six dimensions
| Dimension | llamafu | llama.rn | flutter_llama_cpp |
|---|---|---|---|
| Primary goal | Research-instrumented runtime | Production mobile inference | Dart-native binding |
| Engine | llama.cpp | llama.cpp | llama.cpp |
| Flutter integration | First-class via LlamafuEngine widget | Via platform channels | Direct FFI |
| Default model size | 7B Q4_K_M | 3-7B Q4 | 3-13B configurable |
| Memory measurements | First-class (publishes per-device numbers) | No | No |
| Quantisation benchmarks | Published (Q4/Q5/Q8 quality delta) | No | No |
| Tool calling | Yes | Yes | Limited |
| Streaming | Yes | Yes | Yes |
| Multi-model (hot-swap) | Yes | Limited | Yes |
| Embedding | Yes | No | No |
| iOS support | Yes | Yes | Yes |
| Android support | Yes | Yes | Yes |
| macOS / Windows / Linux | Yes | No (mobile-first) | Yes |
| Active development | Yes | Yes | Yes |
| License | MIT | MIT | MIT |
| Production users | (early) | Many | Many |
When to use which
Use llamafu when:
- You are doing research on on-device LLM performance and need the measurements, not just the inference.
- You need to publish a benchmark (paper, blog, internal report) and want the numbers to be defensible.
- You want a single API that works on iOS, Android, macOS, Windows, and Linux without per-platform code paths.
- You are building an agent on top of the runtime (tool calls, memory) and want a coherent API surface.
Use llama.rn when:
- Your app is React Native, not Flutter.
- You are shipping fast and the default model parameters are good enough.
- The community-maintained version is what your team already knows.
Use flutter_llama_cpp when:
- You are pure-Dart and want the lowest-level FFI access to llama.cpp.
- You want to customise the engine (sampling, KV cache, batching) at the C-API level.
- You are migrating an existing llama.cpp C++ project to Flutter.
What llamafu does that the others don’t
Three things, each of them a different kind of value:
-
Published measurements. llamafu’s GitHub releases include per-device token/s numbers for every supported model and quantisation level. The README contains a benchmark table;
docs.llamafu.devhas the full data. This means you can defend your choice of model + device to a stakeholder with data, not vibes. -
Cross-platform API surface. llamafu has a single Dart API that runs on iOS, Android, macOS, Windows, and Linux. The
LlamafuEnginewidget is Flutter-native; the underlying bindings are the same code path everywhere. The same model file works on every platform. -
First-class agent support. llamafu is designed to be the inference substrate for ukkin (our on-device mobile AI agent). Tool calls, structured output, and reasoning traces are part of the API, not bolted on.
What llamafu does NOT do
- It is not the fastest option for raw token/s. If you only care about throughput and don’t need the measurements or the agent support, llama.rn is comparable.
- It is not the most flexible at the C-API level. If you need to call llama.cpp internals directly, flutter_llama_cpp gives you more control.
- It does not have a managed model hub. You bring your own GGUF files. This is intentional (we believe in local models, not managed) but worth knowing.
A realistic 30-minute eval
# 1. Add llamafu to your Flutter app
flutter pub add llamafu
# 2. Drop a GGUF model in your assets
mkdir -p assets/models
curl -L -o assets/models/qwen2.5-1.5b-instruct-q4_k_m.gguf \
https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf
# 3. Wire it up
import 'package:llamafu/llamafu.dart';
final engine = LlamafuEngine.fromAsset(
'assets/models/qwen2.5-1.5b-instruct-q4_k_m.gguf',
contextSize: 2048,
);
final response = await engine.generate('Hello in three words.');
print(response.text);
print('Tokens/s: ${response.metrics.tokensPerSecond}');
print('Memory peak: ${response.metrics.peakMemoryMb} MB');
If you don’t have a GGUF file yet, llamafu will also load from a
URL with LlamafuEngine.fromUrl(...).
What to read next
- Running Language Models on Your Phone: The llamafu Experiment — the full llamafu architecture post
- Autonomous Mobile Agents: ukkin’s Architecture for On-Device AI — the agent layer that runs on llamafu
- llamafu repository
- llama.rn repository
- flutter_llama_cpp repository