llama.cpp Adds Step-3.5-Flash & Kimi-Linear-48B
The llama.cpp project added native support for Step-3.5-Flash and Kimi-Linear-48B-A3B-Instruct models, though community-created GGUF quantizations remain
What It Is
The llama.cpp project recently added native support for two new language models: Step-3.5-Flash and Kimi-Linear-48B-A3B-Instruct. These additions arrived through releases b7964 and b7957 respectively, available at https://github.com/ggml-org/llama.cpp/releases/tag/b7964 and https://github.com/ggml-org/llama.cpp/releases/tag/b7957. However, the community-created GGUF quantizations that most developers rely on haven’t appeared yet from the usual sources.
GGUF quantizations compress models into smaller sizes while maintaining acceptable performance, making them practical for local deployment. Without these quantized versions, running these new models requires significantly more VRAM and storage - often prohibitive for consumer hardware.
Why It Matters
This gap between official support and community quantizations highlights a dependency pattern in the local AI ecosystem. While llama.cpp maintainers can add model architecture support quickly, the quantization work typically falls to community contributors who convert and test different compression levels.
Step-3.5-Flash represents StepFun AI’s latest iteration, while Kimi-Linear-48B-A3B brings Moonshot AI’s 48-billion parameter model to the llama.cpp ecosystem. Both models offer distinct capabilities - Step-3.5-Flash emphasizes speed and efficiency, while Kimi-Linear’s larger parameter count targets more complex reasoning tasks.
The delay matters most for developers who’ve standardized on GGUF workflows. Teams using automated pipelines to test new models, researchers comparing architectures, and hobbyists with limited hardware all wait for quantized versions before experimentation becomes feasible. A 48B model in full precision might require 96GB of VRAM, while a Q4_K_M quantization could run on 32GB.
Getting Started
For Step-3.5-Flash, an early GGUF quantization exists at https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF. Developers can download and test immediately:
# Download a specific quantization huggingface-cli download ubergarm/Step-3.5-Flash-GGUF \
step-3.5-flash-q4_k_m.gguf --local-dir ./models
# Run with llama.cpp
./llama-cli -m ./models/step-3.5-flash-q4_k_m.gguf \
-p "Explain quantum computing" -n 256
For Kimi-Linear-48B-A3B, checking https://huggingface.co/models?library=gguf&other=base_model:quantized:moonshotai%2FKimi-Linear-48B-A3B-Instruct&sort=created shows the current quantization status. As of this writing, popular quantizers haven’t published versions yet.
Developers comfortable with quantization can create their own using llama.cpp’s conversion tools. The process requires downloading the original model weights, converting to GGUF format, then applying quantization:
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
Context
This situation recurs with every new model architecture. Popular quantizers like TheBloke (now retired), MaziyarPanahi, and bartowski typically publish comprehensive quantization sets within days of new model releases, but timing varies based on model size and complexity.
Alternative approaches exist. Some developers run models through Ollama, which handles quantization automatically but offers less granular control. Others use vLLM or text-generation-webui, which support different quantization schemes like AWQ or GPTQ - though these require different infrastructure than GGUF-focused workflows.
The 48B parameter size of Kimi-Linear presents particular challenges. Even aggressive quantization to Q2_K produces files around 20GB, pushing the limits of consumer GPUs. Developers might consider waiting for IQ (importance-weighted quantization) variants, which often provide better quality-to-size ratios for larger models.
For production deployments, this gap reinforces the value of maintaining fallback options. Teams relying on cutting-edge models should architect systems that can gracefully handle delays in quantized availability, whether through temporary use of alternative models or acceptance of higher resource requirements during transition periods.
Related Tips
Semantic Video Search with Qwen3-VL Embedding
Explores how to implement semantic video search using Qwen3-VL embeddings to enable natural language queries that find relevant video content based on visual
GPU Kernel Optimizer for llama.cpp on AMD Cards
kernel-anvil is a profiling tool that generates optimized GPU kernel configurations for llama.cpp on AMD graphics cards by analyzing layer shapes in GGUF
Text Search Outperforms Embeddings for Small Data
Traditional text search algorithms like BM25 and TF-IDF often outperform modern embedding-based approaches for smaller document collections by using