coding

FlashHead: 4× Faster LLM Inference with IR-Based Head

FlashHead accelerates large language model inference by up to four times using an information retrieval-based attention head mechanism that reduces

Someone found a way to speed up small language models by swapping out the “head” architecture - the part that predicts the next token.

FlashHead replaces the traditional language model head with an information retrieval approach that’s way faster but keeps perfect accuracy. On a RTX 3500, Llama 3.2 1B goes from 130 to 163 tokens/sec in BF16 (25% faster). Stack it with 4-bit quantization and it hits 485 tokens/sec - nearly 4× the original speed.

Works as a drop-in replacement with vLLM:

 --model embedl/Llama-3.2-3B-Instruct-FlashHead-W4A16

Pretty cool that it stacks on top of existing optimizations like quantization rather than replacing them. Models behave identically to their originals, just generate tokens faster.

GitHub: https://github.com/embedl/embedl-models