Optimizing llama-server Speed with Batch Tweaks
Learn how adjusting batch size parameters in llama-server can significantly improve inference speed and throughput for large language model deployments and
Someone figured out how to squeeze more speed out of llama-server by tweaking batch settings and cache parameters.
Their working config:
--jinja \
--host 0.0.0.0 \
-m /mnt/models1/GLM/GLM-4.7-Flash-Q8_0.gguf \
--ctx-size 200000 \
--parallel 1 \
--batch-size 2048 \
--ubatch-size 1024 \
--flash-attn on \
--cache-ram 61440 \
--context-shift
Key parts: --batch-size 2048 and --ubatch-size 1024 handle larger chunks at once, while --flash-attn on speeds up attention calculations. The --cache-ram 61440 (60GB) keeps more context in memory.
Next step they’re planning: self-speculative decoding (https://www.reddit.com/r/LocalLLaMA/comments/1qpjc4a/add_selfspeculative_decoding_no_draft_model/) which predicts multiple tokens at
Related Tips
KaniTTS2: Fast Local Text-to-Speech with Cloning
KaniTTS2 provides a fast, locally-run text-to-speech system with voice cloning capabilities, enabling users to generate natural-sounding speech from text while
AdaLLM: True FP4 Inference on RTX 4090s Without FP16 Fallbac
AdaLLM enables genuine 4-bit floating-point inference on RTX 4090 GPUs without reverting to 16-bit precision, delivering faster and more memory-efficient large
Chatbot Framework Rebuilt in Rust: 10MB Binary
A chatbot framework originally written in another language has been completely rewritten in Rust, resulting in a remarkably compact 10MB binary that