AI in Your Pocket: Running an LLM from a USB Stick

Published  February 14, 2025   0
AI in Your Pocket: Running an LLM from a USB Stick

Large language models (LLMs) like GPT, BERT,  Llama, and Mistral are becoming widely used in various applications, from chatbots to coding assistants. However, running these models typically requires powerful hardware, such as high-end GPUs or dedicated AI accelerators, which can be costly and consume significant power. In a recent video, Binh Pham from the YouTube channel Build With Binh demonstrates how to make LLMs more accessible. He has built a USB stick that runs a local LLM entirely offline, using nothing more than a modified Raspberry Pi Zero W. This device functions like a bootable flash drive, allowing users to interact with an LLM by simply creating a file with a specific name. The model then generates text content automatically and writes it to the file. Instead of using traditional chat interfaces or APIs, this innovative interaction method leverages USB mass storage and Linux gadget mode to detect and respond to file changes dynamically.

To achieve this, Binh modified the llama.cpp framework to compile on the Pi Zero’s ARM1176JZF CPU, which lacks ARM’s Neon instruction support. Standard builds of llama.cpp assumes newer ARM architectures, making them incompatible with older devices. By adjusting the CMake build files, he successfully compiled LLM models on the Pi Zero, despite its limited 512MB RAM and single-core processor. Benchmarks of the models running on this tiny system reveal the trade-offs of such low-power hardware: Tiny-15M generates tokens at 223ms per token, while the larger Lamini-T5-Flan-77M takes 2.5 seconds per token, and SmolLM2-136M requires 2.2 seconds per token. These results highlight both the feasibility and limitations of running LLMs on ultra-low-power devices. While the current performance is slow compared to modern AI accelerators, it opens up possibilities for localized, privacy-focused AI solutions that do not rely on cloud connectivity.