coding by Promptsicle Team

Hardware-First Guide to Selecting Open-Source LLMs

A comprehensive guide that helps developers choose the right open-source language model based on their available hardware specifications, memory constraints,

Hardware-First Guide to Selecting Open-Source LLMs

A developer with a single RTX 4090 wants to run a coding assistant locally. Another team has access to eight A100 GPUs and needs a model for document analysis. The hardware available dictates which open-source LLMs will actually run, making GPU specifications the starting point for model selection rather than an afterthought.

Matching Models to Memory Constraints

VRAM capacity determines the maximum model size that can run efficiently. A 7B parameter model in 16-bit precision requires approximately 14GB of memory, while quantization to 4-bit reduces this to roughly 4GB. The Llama 3.1 8B model fits comfortably on consumer GPUs with 12GB VRAM when quantized, while the 70B variant demands at least 40GB even with aggressive quantization.

Memory requirements scale with both parameter count and precision. Running models in full fp16 precision doubles memory needs compared to 8-bit quantization, but maintains maximum accuracy. The trade-off becomes critical at hardware boundaries—a 13B model in 4-bit quantization might fit on a 16GB GPU, but the same model in 8-bit would require 24GB.

Multi-GPU setups enable larger models through tensor parallelism, splitting model layers across devices. Mixtral 8x7B, despite its 47B total parameters, can run on dual 24GB GPUs because only a subset of experts activate per token. Tools like https://github.com/ggerganov/llama.cpp provide memory calculators that account for context length, batch size, and quantization schemes.

Compute Requirements Beyond Memory

Inference speed depends on GPU compute capability, not just capacity. An RTX 3090 with 24GB VRAM generates tokens slower than an RTX 4090 with the same memory due to architectural improvements. Tensor cores in newer GPUs accelerate matrix operations central to transformer models, making a 4090 nearly twice as fast as a 3090 for inference tasks.

Context length multiplies compute demands. Processing 32K tokens requires significantly more computation than 4K tokens, even with the same model. Models like Mistral 7B support extended contexts but run slower when actually using them. Hardware with higher memory bandwidth—like the H100’s HBM3 versus consumer GPU GDDR6X—handles long contexts more efficiently.

Quantization methods affect speed differently across hardware. GPTQ quantization performs well on consumer GPUs, while AWQ often runs faster on datacenter cards. The llama.cpp framework supports GGUF quantization optimized for CPU inference, enabling models to run on systems without dedicated GPUs:

# Example memory calculation for model selection
def estimate_vram_gb(params_billions, bits_per_param, context_length=2048):
    model_size = params_billions * bits_per_param / 8
    context_overhead = context_length * params_billions * 0.0001
    return model_size + context_overhead + 2  # +2GB for overhead

# 13B model at 4-bit quantization
print(f"VRAM needed: {estimate_vram_gb(13, 4):.1f}GB")  # ~8.6GB

Selecting Models for Specific Hardware Profiles

Consumer GPUs in the 12-16GB range suit 7B-13B models with 4-bit quantization. Llama 3.1 8B, Mistral 7B, and Phi-3 Medium excel in this tier, offering strong performance for coding, summarization, and general chat. These models handle 4K-8K context windows comfortably, sufficient for most applications.

Professional GPUs with 24-48GB enable 30B-70B models or multiple smaller models simultaneously. CodeLlama 34B provides superior code generation compared to 7B alternatives, while Mixtral 8x7B offers GPT-3.5-class performance. This hardware tier supports 16K+ context lengths and can run multiple specialized models for different tasks.

Datacenter deployments with 80GB+ per GPU can run 70B+ models in higher precision or serve multiple concurrent users. Llama 3.1 70B approaches GPT-4 performance on many benchmarks, while Falcon 180B pushes capabilities further at the cost of extreme resource requirements. These setups benefit from frameworks like vLLM (https://github.com/vllm-project/vllm) that optimize throughput through continuous batching.

Evolving Hardware Landscape

Emerging quantization techniques continue to reduce memory requirements without proportional quality loss. 3-bit and even 2-bit quantization methods now enable 70B models on consumer hardware, though with measurable accuracy degradation. The gap between consumer and datacenter capabilities narrows as optimization techniques improve.

Specialized inference hardware from companies like Groq and Cerebras promises order-of-magnitude speedups for specific model architectures. These platforms may shift selection criteria from parameter count to tokens-per-second, making smaller, faster models competitive with larger alternatives. The hardware-first approach remains essential, but the specific calculations continue evolving with each generation of accelerators.