llamafu vs llama.rn vs flutter_llama_cpp: On-Device LLM for Flutter in 2026

The question

Should I use llamafu, llama.rn, or flutter_llama_cpp to run an LLM in my Flutter app?

The Flutter-on-device-LLM ecosystem has matured in 2025–2026. There are now three serious contenders, and the differences are real but not always obvious. This post is the comparison we wished we had when we started llamafu.

The 60-second version: llamafu is the research-instrumented runtime; llama.rn is the React-Native bridge; flutter_llama_cpp is the community-maintained FFI binding. They share the same inference engine (llama.cpp) but differ in measurement, ergonomics, and Flutter-native integration.

What each project is

llamafu is a Flutter runtime built around llama.cpp with a measurement-first methodology. Every inference is instrumented; token/s per device, memory ceilings, KV-cache utilisation, and quantisation quality trade-offs are all first-class outputs. The goal is to publish the measurements so the community can compare deployments honestly.

llama.rn is the React Native equivalent — a JS-callable binding to llama.cpp, with a smaller API surface focused on getting models running quickly. It is the de-facto choice for cross-platform mobile (iOS, Android) when the app is RN-based.

flutter_llama_cpp is a community-maintained FFI binding that exposes llama.cpp to Dart directly. It is the most Dart-native of the three and the most configurable at the engine level (KV cache, sampling, batching).

The six dimensions

Dimension	llamafu	llama.rn	flutter_llama_cpp
Primary goal	Research-instrumented runtime	Production mobile inference	Dart-native binding
Engine	llama.cpp	llama.cpp	llama.cpp
Flutter integration	First-class via `LlamafuEngine` widget	Via platform channels	Direct FFI
Default model size	7B Q4_K_M	3-7B Q4	3-13B configurable
Memory measurements	First-class (publishes per-device numbers)	No	No
Quantisation benchmarks	Published (`Q4`/`Q5`/`Q8` quality delta)	No	No
Tool calling	Yes	Yes	Limited
Streaming	Yes	Yes	Yes
Multi-model (hot-swap)	Yes	Limited	Yes
Embedding	Yes	No	No
iOS support	Yes	Yes	Yes
Android support	Yes	Yes	Yes
macOS / Windows / Linux	Yes	No (mobile-first)	Yes
Active development	Yes	Yes	Yes
License	MIT	MIT	MIT
Production users	(early)	Many	Many

When to use which

Use llamafu when:

You are doing research on on-device LLM performance and need the measurements, not just the inference.
You need to publish a benchmark (paper, blog, internal report) and want the numbers to be defensible.
You want a single API that works on iOS, Android, macOS, Windows, and Linux without per-platform code paths.
You are building an agent on top of the runtime (tool calls, memory) and want a coherent API surface.

Use llama.rn when:

Your app is React Native, not Flutter.
You are shipping fast and the default model parameters are good enough.
The community-maintained version is what your team already knows.

Use flutter_llama_cpp when:

You are pure-Dart and want the lowest-level FFI access to llama.cpp.
You want to customise the engine (sampling, KV cache, batching) at the C-API level.
You are migrating an existing llama.cpp C++ project to Flutter.

What llamafu does that the others don’t

Three things, each of them a different kind of value:

Published measurements. llamafu’s GitHub releases include per-device token/s numbers for every supported model and quantisation level. The README contains a benchmark table; docs.llamafu.dev has the full data. This means you can defend your choice of model + device to a stakeholder with data, not vibes.
Cross-platform API surface. llamafu has a single Dart API that runs on iOS, Android, macOS, Windows, and Linux. The LlamafuEngine widget is Flutter-native; the underlying bindings are the same code path everywhere. The same model file works on every platform.
First-class agent support. llamafu is designed to be the inference substrate for ukkin (our on-device mobile AI agent). Tool calls, structured output, and reasoning traces are part of the API, not bolted on.

What llamafu does NOT do

It is not the fastest option for raw token/s. If you only care about throughput and don’t need the measurements or the agent support, llama.rn is comparable.
It is not the most flexible at the C-API level. If you need to call llama.cpp internals directly, flutter_llama_cpp gives you more control.
It does not have a managed model hub. You bring your own GGUF files. This is intentional (we believe in local models, not managed) but worth knowing.

A realistic 30-minute eval

# 1. Add llamafu to your Flutter app
flutter pub add llamafu

# 2. Drop a GGUF model in your assets
mkdir -p assets/models
curl -L -o assets/models/qwen2.5-1.5b-instruct-q4_k_m.gguf \
  https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf

# 3. Wire it up
import 'package:llamafu/llamafu.dart';
final engine = LlamafuEngine.fromAsset(
  'assets/models/qwen2.5-1.5b-instruct-q4_k_m.gguf',
  contextSize: 2048,
);
final response = await engine.generate('Hello in three words.');
print(response.text);
print('Tokens/s: ${response.metrics.tokensPerSecond}');
print('Memory peak: ${response.metrics.peakMemoryMb} MB');

If you don’t have a GGUF file yet, llamafu will also load from a URL with LlamafuEngine.fromUrl(...).