coding

Running LLMs on AMD Ryzen AI NPU via Linux

Developers can now run large language models directly on AMD Ryzen AI NPU hardware in Linux systems using FastFlowLM runtime and Lemonade Server, bypassing CPU

Running LLMs on AMD Ryzen AI NPU via Linux

What It Is

AMD’s Ryzen AI 300 and 400-series processors include a dedicated Neural Processing Unit (NPU) - specialized silicon designed for AI workloads. Until recently, this hardware sat mostly idle on Linux systems. The FastFlowLM runtime, combined with Lemonade Server, now enables developers to run large language models directly on this NPU rather than relying on the CPU or discrete GPU.

The implementation uses AMD’s IRON compiler and upstream kernel drivers to access the NPU. Unlike GPU-based inference, the NPU operates as a separate processing unit optimized for neural network operations, with its own power envelope and thermal characteristics. This means language models can run continuously without competing for graphics resources or triggering laptop cooling fans.

The technical stack requires Linux kernel 7.0 or newer, though some distributions have backported the necessary NPU drivers to 6.xx kernel versions. The FastFlowLM runtime handles model loading and inference scheduling, while Lemonade Server provides the API layer for applications to interact with the models.

Why It Matters

This development opens NPU hardware to the Linux AI development community for the first time. Previously, AMD’s AI accelerators remained largely inaccessible outside Windows environments with vendor-specific tooling. Developers working on edge AI applications, local assistants, or privacy-focused tools now have another deployment target.

Power efficiency represents the primary advantage. NPUs consume significantly less energy than GPUs for inference tasks, making them viable for always-on AI features in laptops. A model running on the NPU might draw 5-10 watts versus 30-50 watts on a discrete GPU. For battery-powered devices, this translates to hours of additional runtime.

The separation of AI workloads from graphics processing also matters for multitasking scenarios. Developers can run language models for code completion or documentation while using the GPU for rendering, gaming, or video encoding without resource conflicts. This architectural separation mirrors Apple’s approach with their Neural Engine.

Open-source tooling around NPU access could accelerate local-first AI applications. As models become more efficient and NPU capabilities expand, running capable language models entirely offline becomes practical for more use cases.

Getting Started

The complete setup guide lives at https://lemonade-server.ai/flm_npu_linux.html with detailed installation steps. The process involves installing the Lemonade SDK and FastFlowLM runtime.

First, verify kernel support for the AMD NPU driver:

Clone the necessary repositories:

The Lemonade repository at https://github.com/lemonade-sdk/lemonade contains the server component, while FastFlowLM at https://github.com/FastFlowLM/FastFlowLM provides the NPU-optimized runtime. Installation typically involves building from source and configuring model paths.

Models need conversion to NPU-compatible formats through the IRON compiler toolchain. The documentation covers which model architectures currently work and any quantization requirements for fitting within NPU memory constraints.

Context

This approach competes with several existing local LLM solutions. llama.cpp remains the most popular cross-platform option, primarily targeting CPUs and GPUs through various backends. ONNX Runtime supports NPUs on Windows but lacks the same Linux integration. ROCm enables AMD GPU usage but requires more power and generates more heat.

NPU inference has limitations. Memory constraints restrict model sizes - current AMD NPUs typically support models up to a few billion parameters. Larger models still require GPU or CPU fallback. Performance varies by model architecture, with some designs mapping better to NPU execution patterns than others.

The requirement for recent kernel versions limits compatibility with enterprise Linux distributions that maintain older kernels. Developers on Ubuntu LTS or RHEL-based systems may need custom kernel builds or wait for backported drivers.

Despite constraints, NPU-based inference fills a specific niche: continuous, low-power AI tasks on laptops and compact systems. As AMD refines NPU capabilities and model optimization techniques improve, this hardware could become the default target for local AI applications where efficiency trumps raw performance.