coding

Hardware-First Guide to Selecting Open-Source LLMs

A hardware-first framework categorizes open-source language model selection into three VRAM tiers: unlimited, medium, and small, helping developers choose

Framework for Choosing LLMs by Hardware Constraints

What It Is

A hardware-first framework for selecting open-source language models flips the traditional approach of choosing models based on parameter counts or benchmark scores. Instead, it categorizes recommendations into three tiers based on available VRAM: unlimited (>128GB), medium (8-128GB), and small (<8GB). This classification acknowledges that most developers face real hardware limitations rather than having infinite compute resources at their disposal.

The framework recognizes that a 70-billion parameter model might achieve impressive results on paper, but remains completely impractical for someone running a consumer GPU with 8GB of memory. By starting with hardware constraints, developers can immediately filter out models that won’t run on their systems and focus on options that will actually execute without constant out-of-memory errors or glacial inference speeds.

This approach also embraces the reality that different tasks require different models anyway. A developer might use a smaller, faster model for classification tasks while reserving larger models for complex reasoning - assuming their hardware supports both options.

Why It Matters

This framework addresses a persistent gap between model discussions and practical deployment. Technical forums and research papers frequently highlight state-of-the-art models without acknowledging that most developers work with standard consumer hardware. A developer with an RTX 4060 gains nothing from recommendations about models requiring 80GB of VRAM across multiple GPUs.

Individual developers and small teams benefit most from this hardware-centric approach. Rather than wasting time downloading and attempting to run incompatible models, they can identify viable options immediately. This saves bandwidth, storage space, and hours of troubleshooting quantization settings or context length adjustments.

The framework also shifts conversations away from purely theoretical performance comparisons. Benchmark scores matter less when a model won’t fit in memory. By grounding recommendations in actual hardware availability, discussions become more actionable and less focused on aspirational setups that most practitioners can’t access.

Organizations running on-premises infrastructure gain clarity about which models match their existing GPU investments. Instead of assuming cloud deployment is the only path forward, teams can evaluate what runs locally on their current hardware before committing to ongoing API costs or cloud compute expenses.

Getting Started

Developers can implement this framework by first checking their available VRAM. On systems with NVIDIA GPUs, running nvidia-smi in a terminal displays total and available memory. For AMD cards, rocm-smi provides similar information.

Once hardware capacity is known, model selection becomes straightforward. For systems under 8GB, quantized versions of smaller models like Phi-3 or Llama 3.2 (3B parameters) offer reasonable performance. These can be loaded using libraries like transformers or llama.cpp:


model = AutoModelForCausalLM.from_pretrained(
 "microsoft/Phi-3-mini-4k-instruct",
 device_map="auto",
 torch_dtype="auto"
)

Systems in the 8-128GB range can handle larger models with quantization. Llama 3.1 (8B or 70B with 4-bit quantization) or Mixtral models become viable options. Tools like Ollama (https://ollama.ai) simplify running these models locally with automatic memory management.

For unlimited tier setups with >128GB VRAM, full-precision versions of the largest open models become accessible, including Llama 3.1 405B or specialized fine-tuned variants.

Context

This framework complements rather than replaces other selection criteria. Benchmark performance, licensing terms, and task-specific capabilities still matter - but only after confirming a model will actually run on available hardware.

Alternative approaches include cloud APIs, which bypass hardware constraints entirely but introduce ongoing costs and latency. Services like Together AI (https://together.ai) or Replicate (https://replicate.com) provide access to large models without local infrastructure, though this trades control and privacy for convenience.

The framework’s main limitation is its focus on inference rather than training. Fine-tuning even small models requires significantly more memory than inference, often pushing developers toward cloud solutions regardless of their inference hardware.

Model quantization techniques continue evolving, potentially shifting which models fit into each tier. GGUF format and newer quantization methods sometimes enable running larger models on smaller hardware than previously possible, though usually with some quality tradeoff.