GLM-4-Flash-7B Benchmarked: Strong Performance on Consumer GPUs

Someone benchmarked GLM-4-Flash-7B and found it’s surprisingly capable for a 7B model, with solid performance on consumer hardware.

Key numbers from their testing:

On H200 with vLLM at 64K context:

Single user: 207 tok/s, 35ms TTFT
32 concurrent users: 2,267 tok/s, 85ms TTFT
Peak: 4,398 tok/s

On RTX 6000 Ada (48GB) with llama.cpp GGUF:

Q4_K_XL quant: 112 tok/s
Q6_K_XL quant: 100 tok/s
Q8_K_XL quant: 91 tok/s

They used the vLLM benchmark CLI (https://docs.vllm.ai/en/latest/benchmarking/cli/) with 500 prompts from InstructCoder dataset. The Unsloth dynamic quants worked well for consumer GPU deployment.

The model handles 64K context on a single H200, though stretching to full 200K context needs 2xH200. Pretty decent option for anyone looking at smaller models that can still handle longer contexts without completely tanking throughput.

GLM-4-Flash-7B Benchmarked: Strong Performance on Consumer GPUs

Related Tips

"Take a Deep Breath" Boosts AI Accuracy on Hard Tasks

LLMs Can Now Play Balatro Autonomously via API

NVIDIA PersonaPlex: Custom Voice AI with Simple Prompts