Optimizing llama-server Speed with Batch Tweaks

Someone figured out how to squeeze more speed out of llama-server by tweaking batch settings and cache parameters.

Their working config:

 --jinja \
 --host 0.0.0.0 \
 -m /mnt/models1/GLM/GLM-4.7-Flash-Q8_0.gguf \
 --ctx-size 200000 \
 --parallel 1 \
 --batch-size 2048 \
 --ubatch-size 1024 \
 --flash-attn on \
 --cache-ram 61440 \
 --context-shift

Key parts: --batch-size 2048 and --ubatch-size 1024 handle larger chunks at once, while --flash-attn on speeds up attention calculations. The --cache-ram 61440 (60GB) keeps more context in memory.

Next step they’re planning: self-speculative decoding (https://www.reddit.com/r/LocalLLaMA/comments/1qpjc4a/add_selfspeculative_decoding_no_draft_model/) which predicts multiple tokens at

Optimizing llama-server Speed with Batch Tweaks

Related Tips

KaniTTS2: Fast Local Text-to-Speech with Cloning

AdaLLM: True FP4 Inference on RTX 4090s Without FP16 Fallbac

Chatbot Framework Rebuilt in Rust: 10MB Binary