Quant Hub

Llama 3.1 70B Instruct

70B

Meta Llama 3.1

Meta's frontier 70B model. Requires 40GB+ VRAM; dual 3090 or M2 Ultra.

33.4 GB

min VRAM

131K

ctx

tok/s

Formats

Pro GPUMac / Apple Silicon

EXL2 3.5bpw

Llama 3.2 3B Instruct

Meta Llama 3.2

Tiny but capable. Runs on 4GB VRAM or 8GB RAM, even on phones via llama.cpp.

2.0 GB

min VRAM

131K

ctx

420

tok/s

Formats

Qwen2.5 7B Instruct

Alibaba Qwen2.5

Alibaba's highly optimized 7B. Punches well above its weight, especially in coding.

4.8 GB

min VRAM

131K

ctx

245

tok/s

Formats

Qwen2.5 14B Instruct

14B

Alibaba Qwen2.5

The sweet spot between performance and resource usage. 16GB VRAM with Q4.

9.2 GB

min VRAM

131K

ctx

138

tok/s

Formats

Qwen2.5 32B Instruct

32B

Alibaba Qwen2.5

Near-GPT-4 reasoning on a 24GB VRAM card (Q4_K_S). Groundbreaking value.

16.4 GB

min VRAM

131K

ctx

tok/s

Formats

DeepSeek-Coder-V2-Lite Instruct

16B

DeepSeek

MoE architecture coding model. Active params ~2.4B, total ~16B. Exceptional code quality.

9.8 GB

min VRAM

164K

ctx

192

tok/s

Formats

Phi-3.5 Mini Instruct

3.8B

Microsoft Phi

Microsoft's tiny powerhouse. Best 4B model for on-device deployment.

2.5 GB

min VRAM

131K

ctx

385

tok/s

Formats

Mistral Nemo 12B Instruct

12B

Mistral AI

Mistral + NVIDIA collaboration. 128K context, excellent multilingual support.

7.8 GB

min VRAM

131K

ctx

148

tok/s

Formats

Gemma 2 9B Instruct

Google Gemma 2

Google's compact Gemma 2 with sliding window attention. Punches above 9B.

5.8 GB

min VRAM

ctx

188

tok/s

Formats

Qwen2.5 72B Instruct

72B

Alibaba Qwen2.5

Flagship Qwen2.5. Requires dual 4090 or A100 80G. Exceptional reasoning at scale.

33.8 GB

min VRAM

131K

ctx

tok/s

Formats

DeepSeek-R1-Distill-Qwen-14B

14B

DeepSeek

R1 reasoning distilled into 14B. Huge community interest; excellent chain-of-thought.

9.2 GB

min VRAM

131K

ctx

128

tok/s

Formats

Llama 3.3 70B Instruct

70B

Meta Llama 3.3

Latest Meta 70B with improved multilingual. Drop-in upgrade from Llama 3.1 70B.

38.2 GB

min VRAM

131K

ctx

tok/s

Formats

Pro GPUMac / Apple Silicon

Mistral Small 24B Instruct

24B

Mistral AI

Mistral's efficient 24B. Strong multilingual; fits on 24GB with Q4.

13.5 GB

min VRAM

33K

ctx

tok/s

Formats

Qwen2.5-Coder 32B Instruct

32B

Alibaba Qwen2.5

Top-tier open coding model. HumanEval competitive with GPT-4o on 32B scale.

16.4 GB

min VRAM

131K

ctx

tok/s

Formats

Qwen2.5-Coder 7B Instruct

Alibaba Qwen2.5

Best 7B coding model. Ideal for local dev assistants on 8–16GB VRAM.

4.8 GB

min VRAM

131K

ctx

248

tok/s

Formats

Qwen2.5 3B Instruct

Alibaba Qwen2.5

Tiny Qwen2.5 for edge devices. Runs on 4GB VRAM or Raspberry Pi class hardware.

2.1 GB

min VRAM

33K

ctx

340

tok/s

Formats

Llama 3.2 1B Instruct

Meta Llama 3.2

Ultra-light Llama for mobile and embedded. Sub-2GB VRAM with Q4.

1.0 GB

min VRAM

131K

ctx

520

tok/s

Formats

DeepSeek-R1-Distill-Llama-70B

70B

DeepSeek

R1 reasoning in Llama 70B architecture. Top open reasoning model for dual-GPU setups.

38.2 GB

min VRAM

131K

ctx

tok/s

Formats

Codestral 22B

22B

Mistral AI

Mistral's dedicated code model. 80+ language support, Fill-in-the-Middle capable.

13.2 GB

min VRAM

33K

ctx

tok/s

Formats

Mixtral 8x7B Instruct

47B MoE

Mistral AI

Classic MoE model. ~13B active params per token; needs 32GB+ VRAM for Q4.

25.2 GB

min VRAM

33K

ctx

tok/s

Formats

Command R 35B

35B

Cohere

Cohere's RAG-optimised model. Excellent retrieval-augmented generation.

20.5 GB

min VRAM

131K

ctx

tok/s

Formats

Yi 1.5 34B Chat

34B

01.AI Yi

01.AI's strong bilingual (EN/ZH) model. Competitive with Qwen 32B.

19.8 GB

min VRAM

ctx

tok/s

Formats

Solar 10.7B Instruct

11B

Upstage

Depth-upscaled 10.7B punching above weight. Strong on reasoning benchmarks.

6.5 GB

min VRAM

ctx

168

tok/s

Formats

StarCoder2 15B

15B

BigCode

BigCode's open code model trained on 600+ languages. Great for polyglot dev.

9.2 GB

min VRAM

16K

ctx

115

tok/s

Formats

Llama 3.2 11B Vision Instruct

11B

Meta Llama 3.2

Multimodal Llama with image understanding. Vision encoder adds ~2GB VRAM overhead.

9.5 GB

min VRAM

131K

ctx

tok/s

Formats

Qwen2-VL 7B Instruct

Alibaba Qwen2

Vision-language model with video understanding. Strong OCR and chart reading.

6.0 GB

min VRAM

33K

ctx

tok/s

Formats

Nous Hermes 3 Llama 3.1 8B

NousResearch

Fine-tuned Llama 3.1 8B with improved roleplay and instruction following.

5.4 GB

min VRAM

131K

ctx

232

tok/s

Formats

GGUFEXL2

WizardLM-2 7B

Microsoft / WizardLM

Evol-Instruct fine-tuned Mistral-based 7B. Strong complex instruction handling.

4.8 GB

min VRAM

33K

ctx

218

tok/s

Formats

Granite 3.1 8B Instruct

IBM Granite

IBM's enterprise-grade 8B. Strong RAG and tool-use; permissive Apache 2.0 license.

5.0 GB

min VRAM

131K

ctx

195

tok/s

Formats

GGUFGPTQ

GPTQ INT4

Gemma 2 2B Instruct

Google Gemma 2

Ultra-compact Gemma 2. Runs on 4GB VRAM; great for edge prototyping.

1.8 GB

min VRAM

ctx

450

tok/s

Formats

Gemma 2 27B Instruct

27B

Google Gemma 2

Largest open Gemma 2. Strong reasoning; needs 24GB+ VRAM at Q4.

16.2 GB

min VRAM

ctx

tok/s

Formats

Qwen2.5 0.5B Instruct

0.5B

Alibaba Qwen2.5

Smallest Qwen2.5. Ideal for Raspberry Pi, phones, and ultra-low-latency demos.

0.6 GB

min VRAM

33K

ctx

620

tok/s

Formats

Qwen2.5 1.5B Instruct

1.5B

Alibaba Qwen2.5

Tiny Qwen with 128K context. Surprisingly capable for summarisation and chat.

1.4 GB

min VRAM

131K

ctx

480

tok/s

Formats

Phi-3 Medium 14B Instruct

14B

Microsoft Phi

Microsoft's mid-size Phi-3. Excellent quality-per-GB on 16GB cards.

8.8 GB

min VRAM

131K

ctx

135

tok/s

Formats

Phi-4 Mini Instruct

3.8B

Microsoft Phi

Latest Phi mini with improved math and code. Strong 4B-class performer.

2.5 GB

min VRAM

131K

ctx

395

tok/s

Formats

Mistral 7B Instruct v0.3

Mistral AI

Classic Mistral 7B v0.3. Still a reliable baseline for local chat APIs.

4.6 GB

min VRAM

33K

ctx

225

tok/s

Formats

DeepSeek-V2-Lite Chat

16B

DeepSeek

MoE general model (~2.4B active). Long context and strong multilingual chat.

9.6 GB

min VRAM

164K

ctx

188

tok/s

Formats

DeepSeek-R1-Distill-Qwen-7B

DeepSeek

R1 reasoning in a 7B footprint. Best value for 8–12GB VRAM CoT experiments.

5.2 GB

min VRAM

131K

ctx

210

tok/s

Formats

GGUFEXL2

DeepSeek-R1-Distill-Qwen-32B

32B

DeepSeek

R1 distilled to 32B. Near-frontier reasoning on a single 24GB card (Q3/Q4).

16.8 GB

min VRAM

131K

ctx

tok/s

Formats

Llama 3.2 90B Vision Instruct

90B

Meta Llama 3.2

Flagship multimodal Llama. Requires dual 4090 or A100; vision adds ~3GB overhead.

44.2 GB

min VRAM

131K

ctx

tok/s

Formats

OLMo 2 7B Instruct

Allen AI OLMo

Fully open training pipeline from Allen AI. Great for reproducibility research.

5.3 GB

min VRAM

ctx

150

tok/s

Formats

InternLM2 7B Chat

Shanghai AI Lab

Strong bilingual (EN/ZH) 7B from Shanghai AI Lab. Competitive with Qwen 7B.

4.9 GB

min VRAM

33K

ctx

215

tok/s

Formats

InternLM2 20B Chat

20B

Shanghai AI Lab

Mid-size InternLM2 with excellent Chinese comprehension. Fits 24GB at Q4.

13.8 GB

min VRAM

33K

ctx

tok/s

Formats

Aya 23 8B

Cohere For AI

Multilingual specialist covering 23 languages. Strong for non-English local apps.

5.0 GB

min VRAM

ctx

205

tok/s

Formats

OpenChat 3.6 8B

OpenChat

C-RLFT fine-tuned Llama 3.1 8B. Known for natural conversational tone.

5.4 GB

min VRAM

ctx

228

tok/s

Formats

GGUFEXL2

Zephyr 7B Beta

HuggingFaceH4

DPO-aligned Mistral 7B. Classic choice for helpful, harmless chat baselines.

5.2 GB

min VRAM

33K

ctx

155

tok/s

Formats

Stable LM 2 12B Chat

12B

Stability AI

Stability AI's 12B chat model. Solid general-purpose option for 16GB GPUs.

7.2 GB

min VRAM

ctx

142

tok/s

Formats

Falcon 3 10B Instruct

10B

TII UAE

Technology Innovation Institute's latest Falcon. Good multilingual and code mix.

6.2 GB

min VRAM

33K

ctx

155

tok/s

Formats

GGUFGPTQ

GPTQ INT4

Jamba 1.5 Mini

12B

AI21 Labs

Hybrid SSM-Transformer with 256K context. Efficient long-document QA on 16GB.

7.5 GB

min VRAM

262K

ctx

125

tok/s

Formats