Best Local LLMs for MacBook M5 16GB in 2026: What Actually Runs Well
Practical guide to running local LLMs on a 16GB M5 — model picks, quantization, real-world speeds, and the ones that don't fit. No marketing fluff.
You bought the standard MacBook M5 with 16GB of unified memory, and now you want to run LLMs locally. The good news: 2026 has excellent open-weight models that run well on this hardware. The honest news: you can’t run the 70B+ frontier models, and you’ll need to pick deliberately.
This article walks through eight models worth your disk space, how each performs on a 16GB M5, and which to install first depending on what you want to do.
TL;DR — what to pick by use case
| Use case | Pick | Why |
|---|---|---|
| All-around assistant | Llama 3.3 8B Instruct (Q4_K_M) | Best general quality at this size; ~5GB on disk |
| Coding | Qwen2.5-Coder 7B (Q4_K_M) | Beats Llama at code; tight fit on 16GB |
| Reasoning / chain-of-thought | DeepSeek-R1-Distill-Qwen 7B | Reasoning traces for free; surprisingly capable |
| Long context (>32k tokens) | Llama 3.3 8B | Stable up to 128k context in practice |
| Smallest fast model | Gemma 3 4B (Q4_K_M) | ~2.5GB, fast, decent quality |
| Multilingual | Qwen 2.5 7B | Best non-English by a noticeable margin |
What 16GB unified memory means for local LLMs
Unified memory on Apple Silicon is shared between CPU, GPU, and Neural Engine. The realistic working budget for an LLM after macOS overhead (~3-4 GB) and your other open apps (~3-4 GB) is roughly 8-9 GB available for the model itself.
Quantization is essential. Q4_K_M (4-bit quantization with K-quant variant M) typically halves a model’s disk footprint with minimal quality loss. A 7B model at full FP16 is ~13 GB; at Q4_K_M it’s ~4.7 GB — comfortable on 16GB.
Practical rule of thumb:
- Model file under 6GB: feels snappy
- Model file 6-8 GB: works but adds memory pressure
- Model file over 8 GB: starts swapping, slows materially
- 30B+ models at any quantization: not viable on 16GB
That rules out 30B and 70B parameter models entirely. What stays comfortable: 4B-8B models at Q4-Q5, and 12-14B models at Q4 with tight headroom.
M5-specific performance notes
The M5 (vs M4) brings improvements in memory bandwidth and Neural Engine throughput. For local LLM inference specifically, expect roughly 7B at Q4 → 40-70 tokens per second range, and 13B at Q4 → 12-25 t/s. Performance varies by inference backend.
The single biggest performance lever beyond hardware: using a backend optimized for Apple Silicon. Apple’s MLX framework runs 1.3-2x faster than the llama.cpp Metal backend for the same Q4 model on the same hardware. LM Studio uses MLX natively; Ollama uses llama.cpp Metal.
(Numbers should be benchmarked on your own hardware before relying on them — actual results vary with thermal state, other running processes, and model-specific implementation details.)
The eight models worth your disk space
1. Llama 3.3 8B Instruct (Q4_K_M)
Meta’s general-purpose chat model. Strong writing, decent coding, reasonable knowledge, stable at long contexts up to 128k tokens.
- Disk: ~4.9 GB
- Real-world: 50-70 t/s on M5 16GB
- Strengths: general assistance, summarization, instruction-following
- Weaknesses: not specialized; loses to Qwen on multilingual and to Qwen-Coder on code
- Pull:
ollama pull llama3.3
2. Qwen 2.5 7B Instruct (Q4_K_M)
Alibaba’s instruction-tuned model. Outstanding multilingual support and strong general reasoning.
- Disk: ~4.7 GB
- Real-world: 55-75 t/s
- Strengths: multilingual, math, structured output
- Weaknesses: writing tone less natural in English than Llama
- Pull:
ollama pull qwen2.5
3. Qwen2.5-Coder 7B Instruct (Q4_K_M)
Code-specialized variant of Qwen 2.5. Trained on substantially more code. Beats general-purpose models for refactoring, completion, and explanation tasks.
- Disk: ~4.7 GB
- Real-world: 55-75 t/s
- Strengths: code understanding, refactoring, bug detection
- Weaknesses: weaker on general conversation
- Pull:
ollama pull qwen2.5-coder
4. DeepSeek-R1-Distill-Qwen 7B (Q4_K_M)
A distillation of DeepSeek’s R1 reasoning model. Emits <think>...</think> blocks showing chain-of-thought before final answers. Good for math, multi-step problems, and tasks where you want to see why the model arrived at an answer.
- Disk: ~4.7 GB
- Real-world: 45-65 t/s (reasoning traces add latency to outputs)
- Strengths: math, multi-step reasoning, transparent thinking
- Weaknesses: verbose; not ideal when you just want a quick one-liner
- Pull:
ollama pull deepseek-r1
5. Gemma 3 4B (Q4_K_M)
Google’s small efficient model. Fastest of the bunch, smallest disk footprint, surprisingly capable for size.
- Disk: ~2.5 GB
- Real-world: 80-110 t/s
- Strengths: speed, low resource use, decent general quality
- Weaknesses: smaller knowledge ceiling, weaker code
- Pull:
ollama pull gemma3:4b
6. Gemma 3 12B (Q4_K_M)
Larger Gemma variant. Tight on 16GB but workable. Better general quality than the 7B tier.
- Disk: ~7.5 GB
- Real-world: 22-35 t/s
- Strengths: better general writing than 7B models
- Weaknesses: memory-tight; close other apps before loading
7. Phi-4 14B (Q4_K_M)
Microsoft’s “small but clever” model. Punches above its weight on reasoning benchmarks.
- Disk: ~8.5 GB
- Real-world: 18-28 t/s
- Strengths: reasoning, math, structured tasks
- Weaknesses: very memory-tight on 16GB; check license terms for commercial use
8. Mistral 7B Instruct v0.3 (Q4_K_M)
Older but stable. Reasonable baseline. Mostly worth knowing because many existing tutorials reference it.
- Disk: ~4.5 GB
- Real-world: 55-70 t/s
- Strengths: stability, broad community support
- Weaknesses: older training data; superseded by Llama 3.3 and Qwen 2.5 on most tasks
- Pull:
ollama pull mistral
Quick quantization guide
- Q4_K_M: best quality/size tradeoff for 16GB Mac. Recommended default.
- Q5_K_M: 20% larger files, ~5% better outputs. Use if you have headroom.
- Q8_0: full quality but 2x disk. Impractical for >7B models on 16GB.
- GGUF vs MLX: MLX is faster on Apple Silicon but smaller ecosystem. GGUF works everywhere. If you’re using LM Studio, prefer MLX variants when available.
Workflow examples
Daily assistant + coding (most users):
- Default: Llama 3.3 8B for chat, summarization, writing
- Code mode: switch to Qwen2.5-Coder 7B
- Disk used: ~9.5 GB for both
Math and structured reasoning:
- DeepSeek-R1-Distill-Qwen 7B + Llama 3.3 8B
- Disk: ~9.6 GB
- Use DeepSeek for problem-solving, Llama for general queries
Speed-first / older machine fallback:
- Gemma 3 4B as primary
- Disk: 2.5 GB
- 80+ t/s feels instant
Polyglot / non-English-heavy:
- Qwen 2.5 7B as primary
- Qwen2.5-Coder for code work
- Both at ~9.4 GB total
When 16GB isn’t enough
Symptoms you’ve outgrown 16GB:
- Loading a 13B+ model triggers swap and inference slows to single-digit t/s
- macOS fan stays loud through normal use
- Browser tabs start reloading after using LLMs heavily
- You want to run multiple models simultaneously
Three options:
- Use smaller models more deliberately. Often the 4B or 7B model is “good enough” — the urge to run bigger is sometimes vanity.
- Run big models in the cloud. Runpod, Lambda Labs, Vast.ai rent H100s/A100s by the hour. ~$1-3/hour for 70B-class workloads.
- Upgrade to a 24GB or 32GB Mac. Genuine cost-benefit calculation if local LLMs are central to your work.
Most users running a couple of 7B models for daily assistance and code work do not need to upgrade.
Privacy and battery notes
All eight models run 100% locally. No data leaves your machine once the model weights are downloaded. Verify your inference tool’s default telemetry settings (LM Studio sends usage analytics by default; can be disabled).
Battery impact during inference is real: expect 2-3x normal drain rate. Plug in for sessions over 20 minutes. Fan noise is a function of model size, query length, and ambient temperature — Gemma 3 4B barely warms the chassis; Phi-4 14B will ramp the fans.
What to install first
If you’re new to local LLMs and want to start:
- Install Ollama:
brew install ollama, thenollama serve - Pull two models to start:
ollama pull llama3.3andollama pull qwen2.5-coder - Test both:
ollama run llama3.3andollama run qwen2.5-coder - If you want a chat UI, install Open WebUI or LM Studio
That’s ~10 GB of disk, two solid models for distinct tasks, and a working setup in under 30 minutes.
For a deeper comparison of the runners themselves (Ollama vs LM Studio vs Jan.ai), see our runner comparison. For a head-to-head of model families, see Llama vs Qwen vs DeepSeek on Apple Silicon.