general

GLM-4-Flash-7B Benchmarked: Strong Performance on Consumer GPUs

GLM-4-Flash-7B demonstrates competitive benchmark performance on consumer-grade GPUs, offering efficient inference speeds and strong accuracy across language

Someone benchmarked GLM-4-Flash-7B and found it’s surprisingly capable for a 7B model, with solid performance on consumer hardware.

Key numbers from their testing:

On H200 with vLLM at 64K context:

  • Single user: 207 tok/s, 35ms TTFT
  • 32 concurrent users: 2,267 tok/s, 85ms TTFT
  • Peak: 4,398 tok/s

On RTX 6000 Ada (48GB) with llama.cpp GGUF:

  • Q4_K_XL quant: 112 tok/s
  • Q6_K_XL quant: 100 tok/s
  • Q8_K_XL quant: 91 tok/s

They used the vLLM benchmark CLI (https://docs.vllm.ai/en/latest/benchmarking/cli/) with 500 prompts from InstructCoder dataset. The Unsloth dynamic quants worked well for consumer GPU deployment.

The model handles 64K context on a single H200, though stretching to full 200K context needs 2xH200. Pretty decent option for anyone looking at smaller models that can still handle longer contexts without completely tanking throughput.