coding

GPU Kernel Optimizer for llama.cpp on AMD Cards

kernel-anvil is a profiling tool that generates optimized GPU kernel configurations for llama.cpp on AMD graphics cards by analyzing layer shapes in GGUF

What It Is

kernel-anvil is a profiling tool that generates optimized GPU kernel configurations for llama.cpp on AMD graphics cards. The tool analyzes each unique layer shape in a GGUF model and produces a JSON configuration file that llama.cpp can load at runtime. Instead of using identical thread and block settings across all layers—regardless of whether they’re small attention layers or massive feed-forward network layers—kernel-anvil tailors the execution parameters to each layer’s specific dimensions.

The tool works by running quick benchmarks on the actual hardware, measuring how different kernel configurations perform for each layer geometry. This profiling process completes in under a second, after which llama.cpp can reference the generated JSON file to apply optimal settings for each layer during inference. No recompilation of llama.cpp is required; the configuration loads dynamically at startup.

Why It Matters

AMD GPU owners running local language models have historically faced performance challenges compared to NVIDIA hardware, partly due to less mature optimization in inference frameworks. kernel-anvil addresses a fundamental inefficiency: treating all matrix operations identically when their computational characteristics differ significantly.

The performance gains are substantial. Testing on a Radeon RX 7900 XTX showed Qwen3.5-27B Q4_K_M inference jumping from 12 tokens per second to 27 tokens per second—a 2.25x speedup. This transforms the practical usability of larger models on AMD hardware, making 27B parameter models run at speeds previously associated with much smaller models.

For developers and researchers working with AMD GPUs, this represents a path to competitive performance without switching hardware or waiting for upstream framework improvements. The approach also demonstrates how layer-specific tuning can unlock performance that generic configurations leave on the table, a principle that could extend to other inference engines and hardware platforms.

Getting Started

kernel-anvil requires the smithy-shape-configs branch of llama.cpp, which includes a small patch to mmvq.cu (approximately 50 lines) that enables loading external kernel configurations. The tool currently supports RDNA3 architecture cards including the 7900 XTX, 7900 XT, and 7800 XT.

Installation and usage follows this workflow:

kernel-anvil gguf-optimize ~/Models/qwen-27b.gguf

This generates a configuration file, typically stored in ~/.cache/smithy/qwen-27b.json. To use the optimized settings:

The -ngl 999 flag offloads all layers to the GPU, ensuring the optimized kernel configurations apply throughout the model. Each GGUF file needs profiling only once; the resulting JSON can be reused across inference sessions.

The project repository is available at https://github.com/Smithy-AI/kernel-anvil for those interested in the implementation details or contributing improvements.

Context

kernel-anvil represents a middle ground between fully automated inference engines and manual kernel tuning. Tools like vLLM and TensorRT-LLM include sophisticated auto-tuning systems, but they often focus on NVIDIA hardware or require model conversion. kernel-anvil works directly with GGUF files and targets AMD GPUs specifically.

The approach has limitations. It currently supports only RDNA3 architecture, leaving RDNA2 and older cards without optimization. The profiling assumes consistent GPU conditions—thermal throttling or background processes during profiling could produce suboptimal configurations. Additionally, the technique optimizes for specific quantization formats; switching from Q4_K_M to Q8_0 would require re-profiling.

Compared to waiting for upstream llama.cpp improvements, kernel-anvil offers immediate gains but requires maintaining a patched branch. As AMD GPU support in llama.cpp matures, some of these optimizations may become unnecessary or get incorporated into the main codebase. Until then, the tool provides a practical way to extract significantly better performance from AMD hardware without deep GPU programming knowledge.