True FP4 Inference on RTX 4090 GPUs with CUDA
AdaLLM enables true 4-bit floating point inference on RTX 4090 GPUs using custom CUDA kernels that maintain FP8 precision throughout computation, avoiding the
What It Is
AdaLLM implements genuine FP4 (4-bit floating point) inference on consumer RTX 4090 GPUs without falling back to FP16 precision during computation. Most quantization approaches claim to use 4-bit weights but secretly convert them to higher precision formats during actual calculations, which defeats the memory savings. This project tackles that problem head-on with custom CUDA kernels that handle FP8 decoding and maintain an FP8 key-value cache throughout the entire inference pipeline.
The implementation currently supports Qwen3 and Gemma3 model families, using NVIDIA’s NVFP4 quantization format. The repository at https://github.com/BenChaliah/NVFP4-on-4090-vLLM provides a modified version of vLLM that keeps computations in lower precision formats rather than expanding them to FP16 when memory gets tight.
Why It Matters
Consumer GPUs like the RTX 4090 pack 24GB of VRAM, which sounds generous until attempting to run modern language models. A standard 8B parameter model in FP16 format consumes roughly 18GB, leaving minimal headroom for batch processing or longer context windows. Larger models simply won’t fit at all.
AdaLLM changes this calculus significantly. Qwen3-8B drops to approximately 7.5GB of memory usage, freeing up space for larger batch sizes or enabling multiple model instances on a single card. More impressively, Gemma3-27B fits into 20GB - making a model of that scale accessible on hardware that previously couldn’t handle it.
Research labs and independent developers working with limited GPU budgets gain the most immediate benefit. Running larger models locally becomes feasible without resorting to cloud instances or multi-GPU setups. The tradeoff involves a 20-25% throughput reduction compared to FP16, but for many applications, fitting the model in memory matters more than peak speed.
Getting Started
Installation requires pulling directly from the GitHub repository:
Serving a quantized model uses straightforward commands:
The system achieves 469 tokens per second at batch size 16 on a single RTX 4090 when running Qwen3-8B. Performance scales with batch size, so applications that can queue multiple requests see better GPU utilization.
Models need to be in NVFP4 format specifically. The repository documentation at https://github.com/BenChaliah/NVFP4-on-4090-vLLM includes details on compatible model variants and conversion processes for other architectures.
Context
Traditional quantization methods like GPTQ or AWQ compress weights to 4 bits but often decompress them during matrix multiplications. This hybrid approach saves storage space but doesn’t fully address memory bandwidth constraints during inference. AdaLLM’s approach keeps data in lower precision formats throughout the computation graph.
The main limitation involves mixture-of-experts (MoE) architectures. While technically functional, MoE models haven’t received optimization attention yet and run slower than their dense counterparts. Dense transformer models represent the current sweet spot for this implementation.
Compared to alternatives like llama.cpp’s Q4 formats or ExLlamaV2, AdaLLM targets a specific niche: developers who want maximum memory efficiency on Ada Lovelace architecture GPUs without sacrificing too much throughput. The 20-25% speed penalty sits between aggressive quantization schemes that trade more speed for memory and conservative approaches that preserve performance at the cost of VRAM.
The project remains early-stage with support limited to two model families. Broader model compatibility and MoE optimizations would expand its utility, but the core achievement - true low-precision inference without hidden fallbacks - demonstrates what’s possible when custom kernels target specific hardware capabilities directly.
Related Tips
AgentHandover: AI Skill Builder from Screen Activity
AgentHandover is an AI skill builder that learns from screen activity to automate repetitive tasks, enabling users to train intelligent agents by demonstrating
Codesight: AI-Ready Codebase Structure Generator
Codesight is an AI-ready codebase structure generator that creates organized, well-documented project architectures optimized for AI code assistants and
Real-time Multimodal AI on M3 Pro with Gemma 2B
A technical guide exploring how to run real-time multimodal AI applications using the Gemma 2B model on Apple's M3 Pro chip, demonstrating local inference