Can you run a large language model on a phone?

Yes. With quantisation (Q4_K_M is the practical default), a 7-billion-parameter model fits in ~4 GB of memory and runs at 18-22 tokens/second on a 2025 flagship phone, or 6-9 tokens/second on a 2023 mid-range device. llamafu is Skelf Research's open-source Flutter runtime for on-device LLM inference.

What is on-device LLM inference?

On-device LLM inference means running the language model entirely on the user's device (phone, laptop, embedded system) without any network round-trips to a server. The model is loaded from local storage, inference runs on the device CPU/GPU/NPU, and outputs are generated locally. The defining property is zero cloud dependency.

What is LLM quantisation?

LLM quantisation reduces the numerical precision of a model's weights from 16-bit or 32-bit floating-point to lower bit-widths (8-bit, 5-bit, 4-bit, even 2-bit) to shrink the model size and speed up inference, at the cost of some model quality. Common formats include GGUF with Q4_K_M, Q5_K_M, and Q8_0 quantisations.

Why run an LLM on-device instead of in the cloud?

Privacy (no data leaves the device), zero latency (no network round-trip), zero per-call cost, offline operation. The trade-off is model quality (smaller models, more aggressive quantisation) and battery/thermals.

Running Language Models on Your Phone: The llamafu Experiment

The Premise

We wanted to answer a straightforward question: what happens when you run a real language model entirely on a phone, with no cloud fallback? Not a toy model. Not a keyword matcher dressed up as AI. A genuine large language model, running inference locally, processing natural language, and generating coherent responses — all within the memory, thermal, and battery constraints of a device that fits in your pocket.

llamafu is our answer. It is a Flutter application that wraps llama.cpp to run quantised LLMs on iOS and Android devices. This article reports what we found: what works, what does not, and where the boundaries lie.

The Setup

llamafu uses llama.cpp as its inference engine. llama.cpp is a C/C++ implementation of LLM inference that is optimised for consumer hardware — it supports quantisation, runs on CPUs without requiring a GPU, and has been aggressively optimised for memory efficiency and throughput on ARM processors.

Flutter provides the application layer. It handles the user interface, file management (model loading, conversation history), and the bridge between Dart and the native C++ inference engine via FFI. Flutter was chosen for cross-platform support: a single codebase targets both iOS and Android, which matters when you are testing on a range of devices with different hardware profiles.

The models we tested ranged from 1.5 billion to 13 billion parameters, in various quantisation levels. The devices ranged from a 2021 mid-range Android phone (6 GB RAM, Snapdragon 695) to a 2025 flagship (16 GB RAM, Snapdragon 8 Elite) and recent iPhones (6-8 GB RAM, A-series and M-series chips).

Memory: The Hard Constraint

Memory is the binding constraint on mobile. It determines which models you can load and how much context you can maintain.

A 7-billion-parameter model at full 16-bit precision requires approximately 14 GB of memory just for the weights. That exceeds the total RAM of most phones, and the operating system and other apps need a share of whatever is available. Quantisation is not optional — it is mandatory.

At 4-bit quantisation (Q4_K_M), a 7B model requires roughly 4.1 GB for weights. On a device with 8 GB of total RAM and perhaps 4-5 GB available to the app, this is tight but feasible. The KV cache for the attention mechanism adds memory proportional to context length. At 2048 tokens of context, the KV cache for a 7B model adds roughly 500 MB. At 4096 tokens, it doubles. This means that on an 8 GB device running a Q4 7B model, you have approximately 2048-3072 tokens of usable context before memory pressure causes the OS to kill the process.

On devices with 6 GB or less, the 7B class is effectively off-limits. We had consistent results with 3B-class models on these devices, and mixed results with quantised 7B models — they would load but frequently be killed by the OS during extended conversations.

The 13B class requires 16 GB devices and even then leaves little headroom. We treat this as the current upper bound for practical mobile deployment.

Quantisation Trade-Offs

Quantisation reduces memory at the cost of model quality. The question is how much quality you lose.

We tested Q4_K_M (4-bit), Q5_K_M (5-bit), and Q8_0 (8-bit) quantisations of a 7B model across a benchmark of 200 tasks spanning summarisation, question answering, simple reasoning, code generation, and conversational response.

The results, briefly:

Q8 vs. full precision: Nearly indistinguishable on our benchmark. Quality degradation was within the noise of human evaluation. The problem with Q8 is purely memory — at roughly 7.5 GB for a 7B model, it only fits on high-end devices, and the context window is severely limited.

Q5 vs. Q8: A small but measurable degradation, primarily on tasks requiring precise numerical reasoning and multi-step logic. Summarisation, question answering, and conversational tasks were essentially unaffected. Q5 at roughly 5.3 GB is a reasonable sweet spot for 8 GB devices.

Q4 vs. Q5: A more noticeable degradation. Factual accuracy dropped by approximately 4% on our QA benchmark. Reasoning tasks showed a larger gap — roughly 8% degradation on multi-step problems. Conversational quality remained acceptable for most purposes. Q4 at roughly 4.1 GB is the practical choice for most current devices, accepting the quality trade-off.

The honest summary: Q4 quantisation of a 7B model produces results that are useful for many tasks but noticeably worse than the full-precision model on anything requiring careful reasoning. You get a capable assistant for simple queries, summarisation, and drafting. You do not get a reliable reasoner.

Latency Measurements

Inference speed on mobile varies dramatically by device and model size. We report tokens per second for generation (the metric that determines how fast text appears).

7B Q4 on a 2025 flagship Android (Snapdragon 8 Elite): 18-22 tokens/second. This feels responsive — text appears at a comfortable reading speed. Initial prompt processing (prefill) for a 500-token prompt takes approximately 1.5 seconds.

7B Q4 on a 2023 mid-range Android (Snapdragon 778G): 6-9 tokens/second. Usable but noticeably slow. Long responses feel sluggish. Prefill for 500 tokens takes approximately 4 seconds.

7B Q4 on iPhone 15 Pro (A17 Pro): 15-18 tokens/second. Apple’s neural engine acceleration through Metal provides competitive performance despite lower raw clock speeds.

3B Q4 on a 2021 mid-range Android (Snapdragon 695): 8-12 tokens/second. Acceptable for short interactions. The smaller model compensates for the slower hardware.

For context, most people read at 3-5 words per second, and a token is roughly 0.75 words. So 10 tokens/second corresponds to roughly 7.5 words/second — faster than reading speed. Anything above about 8 tokens/second feels subjectively responsive for conversational use.

What Works

Simple information tasks. Asking questions, getting definitions, summarising short texts. The model handles these well even at Q4 quantisation on modest hardware.

Privacy-sensitive applications. Medical questions, personal journal analysis, financial document summarisation — any task where the user would prefer that their data never leave the device. This is the strongest argument for on-device inference: not that it is better than cloud inference, but that it is private by construction.

Offline operation. On a plane, in a rural area, in a country with unreliable connectivity. llamafu works identically whether the network is available or not. There is no degradation, no fallback, no “please check your internet connection.”

Drafting and brainstorming. Generating first drafts of emails, messages, and short documents. The quality is adequate for getting started, even if the output requires editing.

What Does Not Work

Complex reasoning. Multi-step mathematical reasoning, complex logical deductions, and nuanced analytical tasks degrade significantly at Q4 quantisation. The model produces plausible-sounding but incorrect answers more frequently than its full-precision counterpart.

Long context. The memory-limited context window (2048-3072 tokens on typical devices) is a serious constraint. Analysing a long document, maintaining a multi-turn conversation over many exchanges, or working with code files that exceed a few hundred lines — these all hit the context ceiling. The model does not fail gracefully when context is truncated; it simply loses track of earlier information.

Speed on older devices. On hardware more than two or three years old, inference is slow enough to be frustrating. Users accustomed to the instant responses of cloud-based assistants will not accept 3-4 second delays before text begins appearing.

Tasks requiring current knowledge. The model’s knowledge is frozen at its training cutoff. It cannot access the internet, check current information, or learn from the user’s corrections. This is inherent to on-device inference without retrieval augmentation.

Thermal and Battery Impact

Running continuous inference on a phone generates significant heat. In sustained use (multiple minutes of continuous generation), device surface temperatures rose by 8-12 degrees Celsius on the devices we tested. Several devices triggered thermal throttling after 5-10 minutes of continuous inference, reducing generation speed by 30-50%.

Battery impact is proportional to compute. On a 2025 flagship, running llamafu for 30 minutes of active inference consumed approximately 12-15% of battery capacity. This is comparable to 30 minutes of intensive gaming. Casual use — a few queries per hour — has negligible battery impact.

Honest Conclusions

llamafu demonstrates that on-device LLM inference is real and useful today, not in some speculative future. It also demonstrates that it is meaningfully limited compared to cloud inference. The limitations are not software bugs to be fixed — they are hardware constraints that will improve gradually as mobile processors gain more memory and compute, but will not vanish.

The right framing is not “on-device instead of cloud” but “on-device when it matters.” Privacy-sensitive tasks. Offline scenarios. Simple queries where latency and availability matter more than peak quality. For these use cases, llamafu provides genuine value today. For complex reasoning, long-context analysis, and tasks requiring large models, cloud inference remains substantially superior.

We will continue to update llamafu as new models and optimisation techniques emerge. The trajectory is encouraging — each generation of mobile hardware makes a larger class of models practical, and each advance in quantisation and architecture makes models more efficient. The gap between on-device and cloud will narrow. It will not close soon, but it will narrow, and the useful operating envelope for on-device inference will expand with it.