Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM
Unsloth Kernels achieves 12x faster Mixture of Experts model training while using only 12GB of VRAM through optimized kernel implementations and memory
Someone found that Unsloth’s new kernels make training huge Mixture of Experts models way more accessible on consumer hardware.
The speedup is pretty wild - they’re claiming 12x faster training with 35% less VRAM. What makes this interesting is you can now fine-tune massive models like gpt-oss-20b in just 12.8GB VRAM or Qwen3-30B in 63GB (16-bit LoRA).
Works on everything from RTX 3090s to H100s. Get started with their free notebooks:
- gpt-oss (20B): https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb
- Main repo: https://github.com/unslothai/unsloth
The memory savings scale exponentially with model size and context length, so the bigger the model, the more you save. They’re using custom Triton kernels with grouped-GEMM + LoRA optimizations that build on top of Transformers v5’s already-improved MoE support.
Turns out you can now fine-tune 30B+ parameter models on consumer GPUs without selling a kidney.
Related Tips
Nvidia's DMS Cuts LLM Memory Usage by 8x
Nvidia introduces Dynamic Memory Scheduling that reduces large language model memory consumption by eight times, enabling more efficient AI inference and
Unsloth Kernels: Fine-Tune 30B MoE on Consumer GPUs
Unsloth Kernels enables efficient fine-tuning of 30 billion parameter Mixture of Experts models on consumer-grade GPUs through optimized memory management and
New llama.cpp Models Need GGUF Quantizations
Users must convert new llama.cpp models to GGUF format through quantization processes before they can be used with the llama.cpp inference engine for local