AdaLLM: True FP4 Inference on RTX 4090s Without FP16 Fallbac

Someone built AdaLLM to run NVFP4 quantized models natively on RTX 4090s without sneaky FP16 fallbacks eating VRAM.

The repo at https://github.com/BenChaliah/NVFP4-on-4090-vLLM uses a custom FP8 decode kernel + FP8 KV cache for the full inference path. Works with Qwen3 and Gemma3 models right now.

Quick setup:

The numbers look pretty good - Qwen3-8B uses ~7.5GB instead of the usual 18GB for FP16, though throughput drops 20-25%. At batch size 16, it hits 469 tokens/sec on a single 4090.

Gemma3-27B squeezes into 20GB, which is wild for a model that size.

One catch: MoE models work but aren’t optimized yet, so they’re slower than expected. Dense models are the sweet spot for now.

AdaLLM: True FP4 Inference on RTX 4090s Without FP16 Fallbac

Related Tips

KaniTTS2: Fast Local Text-to-Speech with Cloning

Chatbot Framework Rebuilt in Rust: 10MB Binary

Femtobot: 10MB Rust Telegram Bot vs 350MB Python