coding

Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM

Unsloth Kernels achieves 12x faster Mixture of Experts model training while using only 12GB of VRAM through optimized kernel implementations and memory

Someone found that Unsloth’s new kernels make training huge Mixture of Experts models way more accessible on consumer hardware.

The speedup is pretty wild - they’re claiming 12x faster training with 35% less VRAM. What makes this interesting is you can now fine-tune massive models like gpt-oss-20b in just 12.8GB VRAM or Qwen3-30B in 63GB (16-bit LoRA).

Works on everything from RTX 3090s to H100s. Get started with their free notebooks:

The memory savings scale exponentially with model size and context length, so the bigger the model, the more you save. They’re using custom Triton kernels with grouped-GEMM + LoRA optimizations that build on top of Transformers v5’s already-improved MoE support.

Turns out you can now fine-tune 30B+ parameter models on consumer GPUs without selling a kidney.