Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM

Someone found that Unsloth’s new kernels make training huge Mixture of Experts models way more accessible on consumer hardware.

The speedup is pretty wild - they’re claiming 12x faster training with 35% less VRAM. What makes this interesting is you can now fine-tune massive models like gpt-oss-20b in just 12.8GB VRAM or Qwen3-30B in 63GB (16-bit LoRA).

Works on everything from RTX 3090s to H100s. Get started with their free notebooks:

gpt-oss (20B): https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb
Main repo: https://github.com/unslothai/unsloth

The memory savings scale exponentially with model size and context length, so the bigger the model, the more you save. They’re using custom Triton kernels with grouped-GEMM + LoRA optimizations that build on top of Transformers v5’s already-improved MoE support.

Turns out you can now fine-tune 30B+ parameter models on consumer GPUs without selling a kidney.

Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM

Related Tips

Nvidia's DMS Cuts LLM Memory Usage by 8x

Unsloth Kernels: Fine-Tune 30B MoE on Consumer GPUs

New llama.cpp Models Need GGUF Quantizations