coding

Optimizing llama-server Speed with Batch Tweaks

Learn how adjusting batch size parameters in llama-server can significantly improve inference speed and throughput for large language model deployments and

Someone figured out how to squeeze more speed out of llama-server by tweaking batch settings and cache parameters.

Their working config:

 --jinja \
 --host 0.0.0.0 \
 -m /mnt/models1/GLM/GLM-4.7-Flash-Q8_0.gguf \
 --ctx-size 200000 \
 --parallel 1 \
 --batch-size 2048 \
 --ubatch-size 1024 \
 --flash-attn on \
 --cache-ram 61440 \
 --context-shift

Key parts: --batch-size 2048 and --ubatch-size 1024 handle larger chunks at once, while --flash-attn on speeds up attention calculations. The --cache-ram 61440 (60GB) keeps more context in memory.

Next step they’re planning: self-speculative decoding (https://www.reddit.com/r/LocalLLaMA/comments/1qpjc4a/add_selfspeculative_decoding_no_draft_model/) which predicts multiple tokens at