coding

Training 20B Models at 7x Longer Context on 24GB GPUs

This article explains how researchers achieved training 20 billion parameter language models with seven times longer context windows using only 24GB GPUs

Someone found a way to train language models with 7x longer context windows using Unsloth’s new RL optimizations.

The setup:

What it actually does:

  • Trains 20B models at 20K context on a 24GB GPU (normally impossible)
  • Pushes Qwen3-8B to 110K context on an H100
  • Works with Llama, Gemma, and other models out of the box

The wild part is combining all their features together - FP8 training, long context support, and memory-efficient RL all stack without breaking anything. Full benchmarks at https://unsloth.ai/docs/new/grpo-long-context if curious about the technical details.