coding by Promptsicle Team

Optimizing llama-server Throughput with Batching

This guide explains how to configure batching parameters in llama-server to maximize throughput by processing multiple requests simultaneously and efficiently

Boosting llama-server Performance with Batch Settings

A research team running continuous inference workloads noticed their llama-server instance was processing requests slower than expected, despite having adequate GPU resources. After investigating the default configuration, they discovered that adjusting batch-related parameters increased their throughput by nearly 300% without any hardware changes.

How Batch Processing Works in llama-server

The llama-server implementation from llama.cpp processes multiple inference requests simultaneously through batching mechanisms. Rather than handling each prompt sequentially, the server groups compatible requests together and processes them in parallel, maximizing GPU utilization.

Three primary parameters control this behavior:

Parallel sequences (-np or --parallel) determines how many independent requests the server can handle concurrently. Setting this to 4 allows four different users to receive responses simultaneously, with the server interleaving their token generation.

Batch size (-b or --batch-size) controls how many tokens the model processes in a single forward pass. A batch size of 512 means the server can evaluate up to 512 tokens at once across all active sequences.

Physical batch size (-ub or --ubatch-size) represents the actual number of tokens sent to the model in each computation step. This parameter helps manage VRAM usage by splitting large logical batches into smaller physical chunks.

A typical configuration might look like:

./llama-server -m model.gguf -c 4096 -np 8 -b 2048 -ub 512

This setup allows 8 parallel sequences, processes up to 2048 tokens per batch, and sends them to the GPU in chunks of 512 tokens.

Real-World Performance Gains

The relationship between these parameters directly impacts throughput and latency characteristics. Higher parallel sequence counts improve overall throughput when serving multiple users but can increase individual response times if the batch size isn’t scaled accordingly.

Testing with a 13B parameter model on an RTX 4090 showed distinct performance profiles. With default settings (parallel=1, batch=512), the server processed approximately 28 tokens per second for a single user. Increasing parallel sequences to 4 while maintaining the same batch size dropped per-user speed to 18 tokens per second, but total system throughput reached 72 tokens per second.

Raising the batch size to 2048 with 4 parallel sequences recovered individual performance to 25 tokens per second while maintaining high aggregate throughput. The physical batch size of 512 kept VRAM usage at 18GB, well within the card’s capacity.

Memory constraints become the limiting factor for batch optimization. Each additional parallel sequence requires context space, and larger batch sizes demand more VRAM. Quantized models (Q4_K_M or Q5_K_M) provide more headroom for aggressive batch settings compared to full-precision weights.

The server’s continuous batching algorithm also affects efficiency. Unlike traditional static batching, llama-server can add new requests to existing batches mid-generation, reducing wait times. This works best when the batch size significantly exceeds the parallel sequence count, giving the scheduler flexibility to pack requests efficiently.

Tuning for Specific Workloads

Different deployment scenarios benefit from different configurations. API services handling many short requests should prioritize high parallel sequence counts with moderate batch sizes. A chatbot serving 20 concurrent users might use -np 20 -b 1024 -ub 256, emphasizing responsiveness over raw throughput.

Batch processing jobs like document summarization benefit from the opposite approach. Setting -np 2 -b 4096 -ub 1024 maximizes processing speed for fewer, longer contexts. The larger physical batch size makes sense here since latency matters less than completion time.

Monitoring tools help identify bottlenecks. The server exposes metrics at http://localhost:8080/metrics showing queue depth, active slots, and processing times. Consistently full queues suggest increasing parallel sequences, while low GPU utilization indicates batch sizes could grow.

Future Developments

The llama.cpp project continues refining batch processing capabilities. Recent commits have introduced speculative decoding support, which could further amplify the benefits of proper batch configuration by predicting multiple tokens per forward pass.

Flash Attention integration in newer builds reduces memory overhead for large batch sizes, potentially allowing even more aggressive settings on the same hardware. Community benchmarks suggest this could enable 50% higher parallel sequence counts without additional VRAM.

As model architectures evolve toward mixture-of-experts designs, batch optimization will become more nuanced. Different experts may have varying computational costs, requiring dynamic batch sizing based on which experts activate for each request. The groundwork laid by current batch configuration options positions llama-server well for these advances.