True FP4 Inference on RTX 4090 GPUs with CUDA

Running large language models on consumer hardware has hit a wall. A 70-billion parameter model typically requires over 140GB of memory at half-precision, putting it far beyond the reach of even high-end gaming GPUs. Quantization techniques have helped, but most “4-bit” implementations still perform calculations at higher precision, converting weights on-the-fly and sacrificing the full speed potential of true 4-bit operations.

The Breakthrough

NVIDIA’s latest CUDA toolkit now enables genuine FP4 (4-bit floating point) inference on RTX 4090 GPUs through native tensor core support. Unlike previous quantization methods that stored weights in 4-bit format but computed in FP16 or INT8, this implementation performs matrix multiplications directly in 4-bit precision. The RTX 4090’s Ada Lovelace architecture includes specialized tensor cores capable of FP4 operations, delivering up to 1.32 petaflops of theoretical throughput when operating at this precision level.

The implementation works through CUDA’s cutlass library and new kernel primitives that map directly to the GPU’s FP4 tensor core instructions. Developers can access these capabilities through updated versions of popular inference frameworks, with initial support appearing in TensorRT-LLM and vLLM. The format uses a modified floating point representation with 1 sign bit, 2 exponent bits, and 1 mantissa bit, providing a range suitable for neural network weights after proper calibration.

Technical Implementation Details

The FP4 tensor cores operate on specific tile sizes, typically 16x16x64 for matrix multiplication operations. This means the GPU loads 16x64 elements from the first matrix, 64x16 from the second, and accumulates results into a 16x16 output tile. The accumulation happens in higher precision (FP16 or FP32) to prevent overflow, but the memory bandwidth savings come from reading weights in their native 4-bit format.

Quantization-aware training or post-training quantization remains necessary to prepare models for FP4 inference. The narrow range of representable values means careful calibration of scaling factors per layer or tensor group. Most implementations use block-wise quantization with 128 or 256 elements sharing a common scale factor, balancing memory efficiency against accuracy preservation.

Memory bandwidth becomes the primary bottleneck rather than compute capacity. An RTX 4090 with 1TB/s of memory bandwidth can theoretically feed its tensor cores with 2 trillion 4-bit values per second. For a 70B parameter model quantized to FP4, the entire weight set occupies roughly 35GB, fitting comfortably within the card’s 24GB VRAM when combined with activation memory and KV cache optimizations.

Example code for initializing FP4 inference in TensorRT-LLM:

import tensorrt_llm
from tensorrt_llm.quantization import QuantMode

model = tensorrt_llm.models.LLaMAForCausalLM.from_hugging_face(
    model_dir="meta-llama/Llama-2-70b-hf",
    dtype="float16",
    quant_mode=QuantMode.FP4_AWQ
)

engine = model.to_trt(
    max_batch_size=1,
    max_input_len=2048,
    max_output_len=512
)

Impact on the ML Community

Researchers working with large models gain immediate access to architectures previously requiring datacenter GPUs. A single RTX 4090 can now run 70B parameter models at interactive speeds, with inference throughput reaching 25-35 tokens per second depending on sequence length and batch size. This democratizes experimentation with state-of-the-art models for individuals and small teams.

Commercial applications benefit from reduced deployment costs. Serving infrastructure that previously required A100 or H100 GPUs can migrate to consumer hardware, cutting per-GPU costs from $10,000-30,000 to under $2,000. The tradeoff involves slightly lower accuracy, typically 1-2% degradation on standard benchmarks compared to FP16 inference, which proves acceptable for many production use cases.

Looking at the Bigger Picture

True FP4 support represents a continuation of the precision reduction trend in neural network inference. Earlier transitions from FP32 to FP16 to INT8 each unlocked new deployment scenarios, and FP4 follows this pattern. The technique works best for inference rather than training, where gradient precision requirements remain higher.

Competition between GPU manufacturers will likely intensify around low-precision capabilities. AMD’s upcoming RDNA 4 architecture and Intel’s Battlemage GPUs will need comparable features to remain competitive in the AI inference market. The standardization of FP4 formats across hardware vendors remains an open question, potentially fragmenting the ecosystem if proprietary implementations diverge.

For practitioners, FP4 inference offers a practical path to running frontier models locally. The combination of accessible hardware and mature software tooling removes significant barriers to experimentation and deployment.

True FP4 Inference on RTX 4090 GPUs with CUDA

True FP4 Inference on RTX 4090 GPUs with CUDA

The Breakthrough

Technical Implementation Details

Impact on the ML Community

Looking at the Bigger Picture

Related Tips

Caveman: Slashing AI Development Time on Benchmarks

Abliteration: Surgical Removal of AI Safety Filters

AgentHandover: Auto-Generate AI Skills from Screen Use