Unsloth Kernels: Fine-Tune 30B MoE on Consumer GPUs

Someone found that Unsloth’s new kernels let you fine-tune massive MoE models on surprisingly cheap hardware now.

The specs are pretty wild:

gpt-oss-20b fits in 12.8GB VRAM (runs on a single RTX 3090)
Qwen3-30B needs just 63GB for 16-bit LoRA
12x faster training with 35% less memory than before
Works on consumer GPUs, not just data-center stuff

They built custom Triton kernels that optimize the grouped matrix multiplications in MoE architectures. Turns out the memory savings scale exponentially - the bigger your model and context length, the more you save.

Free Colab notebooks to try it:

gpt-oss (20B): https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb
Main repo: https://github.com/unslothai/unsloth

Works with Qwen3, DeepSeek R1/V3, and GLM models too. The whole thing is open source, so anyone can run billion-parameter MoE fine-tuning without renting expensive cloud instances

Unsloth Kernels: Fine-Tune 30B MoE on Consumer GPUs

Related Tips

Nvidia's DMS Cuts LLM Memory Usage by 8x

Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM

New llama.cpp Models Need GGUF Quantizations