AdaLLM: True FP4 Inference on RTX 4090s Without FP16 Fallbac
AdaLLM enables genuine 4-bit floating-point inference on RTX 4090 GPUs without reverting to 16-bit precision, delivering faster and more memory-efficient large
Someone built AdaLLM to run NVFP4 quantized models natively on RTX 4090s without sneaky FP16 fallbacks eating VRAM.
The repo at https://github.com/BenChaliah/NVFP4-on-4090-vLLM uses a custom FP8 decode kernel + FP8 KV cache for the full inference path. Works with Qwen3 and Gemma3 models right now.
Quick setup:
The numbers look pretty good - Qwen3-8B uses ~7.5GB instead of the usual 18GB for FP16, though throughput drops 20-25%. At batch size 16, it hits 469 tokens/sec on a single 4090.
Gemma3-27B squeezes into 20GB, which is wild for a model that size.
One catch: MoE models work but aren’t optimized yet, so they’re slower than expected. Dense models are the sweet spot for now.
Related Tips
KaniTTS2: Fast Local Text-to-Speech with Cloning
KaniTTS2 provides a fast, locally-run text-to-speech system with voice cloning capabilities, enabling users to generate natural-sounding speech from text while
Chatbot Framework Rebuilt in Rust: 10MB Binary
A chatbot framework originally written in another language has been completely rewritten in Rust, resulting in a remarkably compact 10MB binary that
Femtobot: 10MB Rust Telegram Bot vs 350MB Python
A developer compares building a Telegram bot in Rust versus Python, showing how the Rust version achieves a 10MB binary size compared to Python's 350MB