New llama.cpp Models Need GGUF Quantizations
Users must convert new llama.cpp models to GGUF format through quantization processes before they can be used with the llama.cpp inference engine for local
Someone noticed that llama.cpp just added support for two new models, but there’s a gap before the usual quantized versions show up.
The releases:
- Step3.5-Flash: https://github.com/ggml-org/llama.cpp/releases/tag/b7964
- Kimi-Linear-48B-A3B: https://github.com/ggml-org/llama.cpp/releases/tag/b7957
Checking the usual HuggingFace spots (Kimi GGUFs & Step-3.5 GGUFs) shows nothing from the popular quantizers yet - probably hitting today or tomorrow.
Quick workaround: The ik_llama community already has a Step-3.5-Flash GGUF up at https://huggingface.co/ubergarm/Step-3
Related Tips
Nvidia's DMS Cuts LLM Memory Usage by 8x
Nvidia introduces Dynamic Memory Scheduling that reduces large language model memory consumption by eight times, enabling more efficient AI inference and
Unsloth Kernels: 12x Faster MoE Training, 12GB VRAM
Unsloth Kernels achieves 12x faster Mixture of Experts model training while using only 12GB of VRAM through optimized kernel implementations and memory
Unsloth Kernels: Fine-Tune 30B MoE on Consumer GPUs
Unsloth Kernels enables efficient fine-tuning of 30 billion parameter Mixture of Experts models on consumer-grade GPUs through optimized memory management and