coding

AdaLLM: True FP4 Inference on RTX 4090s Without FP16 Fallbac

AdaLLM enables genuine 4-bit floating-point inference on RTX 4090 GPUs without reverting to 16-bit precision, delivering faster and more memory-efficient large

Someone built AdaLLM to run NVFP4 quantized models natively on RTX 4090s without sneaky FP16 fallbacks eating VRAM.

The repo at https://github.com/BenChaliah/NVFP4-on-4090-vLLM uses a custom FP8 decode kernel + FP8 KV cache for the full inference path. Works with Qwen3 and Gemma3 models right now.

Quick setup:

The numbers look pretty good - Qwen3-8B uses ~7.5GB instead of the usual 18GB for FP16, though throughput drops 20-25%. At batch size 16, it hits 469 tokens/sec on a single 4090.

Gemma3-27B squeezes into 20GB, which is wild for a model that size.

One catch: MoE models work but aren’t optimized yet, so they’re slower than expected. Dense models are the sweet spot for now.