general

M5 Max vs M3 Max: LLM Performance Comparison

New benchmarks compare Apple's M5 Max and M3 Max chips for local LLM inference, measuring tokens per second across dense and Mixture of Experts model

M5 Max vs M3 Max: LLM Inference Benchmarks Revealed

What It Is

Recent benchmark testing has put Apple’s M5 Max and M3 Max chips head-to-head for local large language model inference, revealing significant performance differences between the two generations. The tests measured tokens per second (tok/s) across various model architectures, including both dense models and Mixture of Experts (MoE) configurations. Dense models activate all parameters for every inference, while MoE models selectively activate subsets of parameters, potentially offering better efficiency.

The benchmarks covered models ranging from 27B to 122B parameters, testing them at different context lengths and batch sizes. Context length refers to how much text the model processes at once - longer contexts require more memory bandwidth and computational resources. Batch processing allows multiple prompts to be handled simultaneously, which can improve throughput for applications serving multiple users.

Why It Matters

These results have immediate implications for developers and researchers running local AI workloads. The M5 Max’s 67% speed advantage on the Qwen 35B-A3B model (134.5 tok/s versus 80.3 tok/s) translates to noticeably faster response times in real-world applications. For teams building AI-powered tools that run entirely on-device, this performance gap affects user experience directly.

The long-context performance differential is particularly striking. At 65K token contexts, the M5 Max maintains 19.6 tok/s while the M3 Max drops to 6.8 tok/s - nearly a 3x difference. This matters for applications processing entire documents, codebases, or lengthy conversations where context windows stretch into tens of thousands of tokens.

Perhaps most revealing is the MoE efficiency finding. The 122B parameter model with only 10B active parameters outpaced the 27B dense model on both machines. This challenges assumptions about model selection - total parameter count proves less important than active parameter efficiency. Organizations evaluating which models to deploy locally should reconsider their criteria.

Batching behavior also diverges between generations. The M5 Max scaled to 2.54x throughput at 4x batch size on the 35B model, while M3 Max sometimes saw performance degradation with batching. This affects multi-user scenarios or applications processing multiple requests concurrently.

Getting Started

Developers interested in replicating these benchmarks can use llama.cpp, the popular inference engine that powers many local LLM deployments. Installation on macOS is straightforward:

For running benchmarks with specific models, the command structure follows this pattern:

./llama-bench -m models/qwen-35b-a3b.gguf -n 512 -p 128

The full benchmark results with detailed charts are available at https://claude.ai/public/artifacts/c9fba245-e734-4b3b-be44-a6cabdec6f8

Model files in GGUF format can be downloaded from Hugging Face repositories. The Qwen models tested are available at https://huggingface.co/Qwen, while other popular options include Llama and Mistral variants.

Context

These benchmarks focus specifically on inference speed, but other factors influence model selection. Memory requirements, quantization quality, and task-specific accuracy all matter. The M3 Max with 128GB unified memory can still run larger models than an M5 Max with less RAM, even if slower.

Alternative hardware options exist for local inference. NVIDIA RTX 4090 GPUs often outperform Apple Silicon for pure inference speed, though they lack the unified memory architecture that simplifies deployment. AMD’s MI300 series targets enterprise workloads with massive memory bandwidth.

Cloud-based inference through providers like Together AI or Replicate offers different tradeoffs - no upfront hardware cost but ongoing API expenses and latency from network round-trips. For privacy-sensitive applications or offline requirements, local inference remains essential despite hardware costs.

The MoE efficiency finding aligns with broader industry trends. Models like Mixtral and DeepSeek-V2 demonstrate that sparse activation patterns can deliver strong performance with reduced computational overhead. This architectural approach will likely influence future model development as efficiency becomes increasingly important.