Training 20B Models at 7x Longer Context on 24GB GPUs

Someone found a way to train language models with 7x longer context windows using Unsloth’s new RL optimizations.

The setup:

Install from https://github.com/unslothai/unsloth
Combines three memory tricks: weight-sharing with vLLM, Flex Attention, and async gradient checkpointing
Free notebooks at https://docs.unsloth.ai/get-started/unsloth-notebooks

What it actually does:

Trains 20B models at 20K context on a 24GB GPU (normally impossible)
Pushes Qwen3-8B to 110K context on an H100
Works with Llama, Gemma, and other models out of the box

The wild part is combining all their features together - FP8 training, long context support, and memory-efficient RL all stack without breaking anything. Full benchmarks at https://unsloth.ai/docs/new/grpo-long-context if curious about the technical details.

Training 20B Models at 7x Longer Context on 24GB GPUs

Related Tips

Nvidia's DMS Cuts LLM Memory Usage by 8x

Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM

Unsloth Kernels: Fine-Tune 30B MoE on Consumer GPUs