Quant Hub

Structured index of open-source quantized models — no file hosting, just precise metadata · 51 models in index

Parameters

Category

Hardware

Format

51 / 51 models

Llama 3.1 8B Instruct

8B

Meta Llama 3.1

Meta's flagship 8B model with 128K context. Best-in-class for local deployment.

3.2 GB

min VRAM

131K

ctx

235

tok/s

Formats

GGUFAWQEXL2
Consumer GPUMac / Apple SiliconCPU / VPS
EXL2 4.65bpw
Hugging Face

Llama 3.1 70B Instruct

70B

Meta Llama 3.1

Meta's frontier 70B model. Requires 40GB+ VRAM; dual 3090 or M2 Ultra.

33.4 GB

min VRAM

131K

ctx

62

tok/s

Formats

GGUFAWQEXL2
Pro GPUMac / Apple Silicon
EXL2 3.5bpw
Hugging Face

Llama 3.2 3B Instruct

3B

Meta Llama 3.2

Tiny but capable. Runs on 4GB VRAM or 8GB RAM, even on phones via llama.cpp.

2.0 GB

min VRAM

131K

ctx

420

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple SiliconCPU / VPS
AWQ INT4
Hugging Face

Qwen2.5 7B Instruct

7B

Alibaba Qwen2.5

Alibaba's highly optimized 7B. Punches well above its weight, especially in coding.

4.8 GB

min VRAM

131K

ctx

245

tok/s

Formats

GGUFAWQEXL2
Consumer GPUMac / Apple SiliconCPU / VPS
EXL2 4.65bpw
Hugging Face

Qwen2.5 14B Instruct

14B

Alibaba Qwen2.5

The sweet spot between performance and resource usage. 16GB VRAM with Q4.

9.2 GB

min VRAM

131K

ctx

138

tok/s

Formats

GGUFAWQEXL2
Consumer GPUMac / Apple Silicon
EXL2 4.65bpw
Hugging Face

Qwen2.5 32B Instruct

32B

Alibaba Qwen2.5

Near-GPT-4 reasoning on a 24GB VRAM card (Q4_K_S). Groundbreaking value.

16.4 GB

min VRAM

131K

ctx

68

tok/s

Formats

GGUFEXL2
Consumer GPUPro GPU
EXL2 3.5bpw
Hugging Face

DeepSeek-Coder-V2-Lite Instruct

16B

DeepSeek

MoE architecture coding model. Active params ~2.4B, total ~16B. Exceptional code quality.

9.8 GB

min VRAM

164K

ctx

192

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple Silicon
AWQ INT4
Hugging Face

Phi-3.5 Mini Instruct

3.8B

Microsoft Phi

Microsoft's tiny powerhouse. Best 4B model for on-device deployment.

2.5 GB

min VRAM

131K

ctx

385

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple SiliconCPU / VPS
AWQ INT4
Hugging Face

Mistral Nemo 12B Instruct

12B

Mistral AI

Mistral + NVIDIA collaboration. 128K context, excellent multilingual support.

7.8 GB

min VRAM

131K

ctx

148

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple Silicon
AWQ INT4
Hugging Face

Gemma 2 9B Instruct

9B

Google Gemma 2

Google's compact Gemma 2 with sliding window attention. Punches above 9B.

5.8 GB

min VRAM

8K

ctx

188

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple Silicon
AWQ INT4
Hugging Face

Qwen2.5 72B Instruct

72B

Alibaba Qwen2.5

Flagship Qwen2.5. Requires dual 4090 or A100 80G. Exceptional reasoning at scale.

33.8 GB

min VRAM

131K

ctx

48

tok/s

Formats

GGUFAWQEXL2
Pro GPU
EXL2 3.5bpw
Hugging Face

DeepSeek-R1-Distill-Qwen-14B

14B

DeepSeek

R1 reasoning distilled into 14B. Huge community interest; excellent chain-of-thought.

9.2 GB

min VRAM

131K

ctx

128

tok/s

Formats

GGUFEXL2AWQ
Consumer GPU
EXL2 4.65bpw
Hugging Face

Llama 3.3 70B Instruct

70B

Meta Llama 3.3

Latest Meta 70B with improved multilingual. Drop-in upgrade from Llama 3.1 70B.

38.2 GB

min VRAM

131K

ctx

54

tok/s

Formats

GGUFAWQ
Pro GPUMac / Apple Silicon
AWQ INT4
Hugging Face

Mistral Small 24B Instruct

24B

Mistral AI

Mistral's efficient 24B. Strong multilingual; fits on 24GB with Q4.

13.5 GB

min VRAM

33K

ctx

88

tok/s

Formats

GGUFAWQEXL2
Consumer GPUPro GPU
EXL2 4.65bpw
Hugging Face

Qwen2.5-Coder 32B Instruct

32B

Alibaba Qwen2.5

Top-tier open coding model. HumanEval competitive with GPT-4o on 32B scale.

16.4 GB

min VRAM

131K

ctx

65

tok/s

Formats

GGUFEXL2AWQ
Consumer GPUPro GPU
EXL2 3.5bpw
Hugging Face

Qwen2.5-Coder 7B Instruct

7B

Alibaba Qwen2.5

Best 7B coding model. Ideal for local dev assistants on 8–16GB VRAM.

4.8 GB

min VRAM

131K

ctx

248

tok/s

Formats

GGUFAWQEXL2
Consumer GPUMac / Apple SiliconCPU / VPS
EXL2 4.65bpw
Hugging Face

Qwen2.5 3B Instruct

3B

Alibaba Qwen2.5

Tiny Qwen2.5 for edge devices. Runs on 4GB VRAM or Raspberry Pi class hardware.

2.1 GB

min VRAM

33K

ctx

340

tok/s

Formats

GGUF
Consumer GPUMac / Apple SiliconCPU / VPS
GGUF Q4_K_M
Hugging Face

Llama 3.2 1B Instruct

1B

Meta Llama 3.2

Ultra-light Llama for mobile and embedded. Sub-2GB VRAM with Q4.

1.0 GB

min VRAM

131K

ctx

520

tok/s

Formats

GGUF
Consumer GPUMac / Apple SiliconCPU / VPS
GGUF Q4_K_M
Hugging Face

DeepSeek-R1-Distill-Llama-70B

70B

DeepSeek

R1 reasoning in Llama 70B architecture. Top open reasoning model for dual-GPU setups.

38.2 GB

min VRAM

131K

ctx

52

tok/s

Formats

GGUFAWQ
Pro GPU
AWQ INT4
Hugging Face

Codestral 22B

22B

Mistral AI

Mistral's dedicated code model. 80+ language support, Fill-in-the-Middle capable.

13.2 GB

min VRAM

33K

ctx

72

tok/s

Formats

GGUFAWQ
Consumer GPUPro GPU
AWQ INT4
Hugging Face

Mixtral 8x7B Instruct

47B MoE

Mistral AI

Classic MoE model. ~13B active params per token; needs 32GB+ VRAM for Q4.

25.2 GB

min VRAM

33K

ctx

62

tok/s

Formats

GGUFAWQ
Pro GPU
AWQ INT4
Hugging Face

Command R 35B

35B

Cohere

Cohere's RAG-optimised model. Excellent retrieval-augmented generation.

20.5 GB

min VRAM

131K

ctx

55

tok/s

Formats

GGUFGPTQ
Pro GPU
GPTQ INT4
Hugging Face

Yi 1.5 34B Chat

34B

01.AI Yi

01.AI's strong bilingual (EN/ZH) model. Competitive with Qwen 32B.

19.8 GB

min VRAM

4K

ctx

52

tok/s

Formats

GGUFAWQ
Consumer GPUPro GPU
AWQ INT4
Hugging Face

Solar 10.7B Instruct

11B

Upstage

Depth-upscaled 10.7B punching above weight. Strong on reasoning benchmarks.

6.5 GB

min VRAM

4K

ctx

168

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple Silicon
AWQ INT4
Hugging Face

StarCoder2 15B

15B

BigCode

BigCode's open code model trained on 600+ languages. Great for polyglot dev.

9.2 GB

min VRAM

16K

ctx

115

tok/s

Formats

GGUFGPTQ
Consumer GPU
GPTQ INT4
Hugging Face

Llama 3.2 11B Vision Instruct

11B

Meta Llama 3.2

Multimodal Llama with image understanding. Vision encoder adds ~2GB VRAM overhead.

9.5 GB

min VRAM

131K

ctx

88

tok/s

Formats

GGUF
Consumer GPUMac / Apple Silicon
GGUF Q4_K_M
Hugging Face

Qwen2-VL 7B Instruct

7B

Alibaba Qwen2

Vision-language model with video understanding. Strong OCR and chart reading.

6.0 GB

min VRAM

33K

ctx

95

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple Silicon
AWQ INT4
Hugging Face

Nous Hermes 3 Llama 3.1 8B

8B

NousResearch

Fine-tuned Llama 3.1 8B with improved roleplay and instruction following.

5.4 GB

min VRAM

131K

ctx

232

tok/s

Formats

GGUFEXL2
Consumer GPUMac / Apple SiliconCPU / VPS
EXL2 4.65bpw
Hugging Face

WizardLM-2 7B

7B

Microsoft / WizardLM

Evol-Instruct fine-tuned Mistral-based 7B. Strong complex instruction handling.

4.8 GB

min VRAM

33K

ctx

218

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple SiliconCPU / VPS
AWQ INT4
Hugging Face

Granite 3.1 8B Instruct

8B

IBM Granite

IBM's enterprise-grade 8B. Strong RAG and tool-use; permissive Apache 2.0 license.

5.0 GB

min VRAM

131K

ctx

195

tok/s

Formats

GGUFGPTQ
Consumer GPUMac / Apple SiliconCPU / VPS
GPTQ INT4
Hugging Face

Gemma 2 2B Instruct

2B

Google Gemma 2

Ultra-compact Gemma 2. Runs on 4GB VRAM; great for edge prototyping.

1.8 GB

min VRAM

8K

ctx

450

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple SiliconCPU / VPS
AWQ INT4
Hugging Face

Gemma 2 27B Instruct

27B

Google Gemma 2

Largest open Gemma 2. Strong reasoning; needs 24GB+ VRAM at Q4.

16.2 GB

min VRAM

8K

ctx

58

tok/s

Formats

GGUFAWQ
Consumer GPUPro GPU
AWQ INT4
Hugging Face

Qwen2.5 0.5B Instruct

0.5B

Alibaba Qwen2.5

Smallest Qwen2.5. Ideal for Raspberry Pi, phones, and ultra-low-latency demos.

0.6 GB

min VRAM

33K

ctx

620

tok/s

Formats

GGUF
Consumer GPUMac / Apple SiliconCPU / VPS
GGUF Q4_K_M
Hugging Face

Qwen2.5 1.5B Instruct

1.5B

Alibaba Qwen2.5

Tiny Qwen with 128K context. Surprisingly capable for summarisation and chat.

1.4 GB

min VRAM

131K

ctx

480

tok/s

Formats

GGUF
Consumer GPUMac / Apple SiliconCPU / VPS
GGUF Q4_K_M
Hugging Face

Phi-3 Medium 14B Instruct

14B

Microsoft Phi

Microsoft's mid-size Phi-3. Excellent quality-per-GB on 16GB cards.

8.8 GB

min VRAM

131K

ctx

135

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple Silicon
AWQ INT4
Hugging Face

Phi-4 Mini Instruct

3.8B

Microsoft Phi

Latest Phi mini with improved math and code. Strong 4B-class performer.

2.5 GB

min VRAM

131K

ctx

395

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple SiliconCPU / VPS
AWQ INT4
Hugging Face

Mistral 7B Instruct v0.3

7B

Mistral AI

Classic Mistral 7B v0.3. Still a reliable baseline for local chat APIs.

4.6 GB

min VRAM

33K

ctx

225

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple SiliconCPU / VPS
AWQ INT4
Hugging Face

DeepSeek-V2-Lite Chat

16B

DeepSeek

MoE general model (~2.4B active). Long context and strong multilingual chat.

9.6 GB

min VRAM

164K

ctx

188

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple Silicon
AWQ INT4
Hugging Face

DeepSeek-R1-Distill-Qwen-7B

7B

DeepSeek

R1 reasoning in a 7B footprint. Best value for 8–12GB VRAM CoT experiments.

5.2 GB

min VRAM

131K

ctx

210

tok/s

Formats

GGUFEXL2
Consumer GPUMac / Apple SiliconCPU / VPS
EXL2 4.65bpw
Hugging Face

DeepSeek-R1-Distill-Qwen-32B

32B

DeepSeek

R1 distilled to 32B. Near-frontier reasoning on a single 24GB card (Q3/Q4).

16.8 GB

min VRAM

131K

ctx

65

tok/s

Formats

GGUFEXL2
Consumer GPUPro GPU
EXL2 3.5bpw
Hugging Face

Llama 3.2 90B Vision Instruct

90B

Meta Llama 3.2

Flagship multimodal Llama. Requires dual 4090 or A100; vision adds ~3GB overhead.

44.2 GB

min VRAM

131K

ctx

28

tok/s

Formats

GGUF
Pro GPU
GGUF Q3_K_M
Hugging Face

OLMo 2 7B Instruct

7B

Allen AI OLMo

Fully open training pipeline from Allen AI. Great for reproducibility research.

5.3 GB

min VRAM

4K

ctx

150

tok/s

Formats

GGUF
Consumer GPUMac / Apple SiliconCPU / VPS
GGUF Q4_K_M
Hugging Face

InternLM2 7B Chat

7B

Shanghai AI Lab

Strong bilingual (EN/ZH) 7B from Shanghai AI Lab. Competitive with Qwen 7B.

4.9 GB

min VRAM

33K

ctx

215

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple SiliconCPU / VPS
AWQ INT4
Hugging Face

InternLM2 20B Chat

20B

Shanghai AI Lab

Mid-size InternLM2 with excellent Chinese comprehension. Fits 24GB at Q4.

13.8 GB

min VRAM

33K

ctx

78

tok/s

Formats

GGUF
Consumer GPUPro GPU
GGUF Q4_K_M
Hugging Face

Aya 23 8B

8B

Cohere For AI

Multilingual specialist covering 23 languages. Strong for non-English local apps.

5.0 GB

min VRAM

8K

ctx

205

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple SiliconCPU / VPS
AWQ INT4
Hugging Face

OpenChat 3.6 8B

8B

OpenChat

C-RLFT fine-tuned Llama 3.1 8B. Known for natural conversational tone.

5.4 GB

min VRAM

8K

ctx

228

tok/s

Formats

GGUFEXL2
Consumer GPUMac / Apple SiliconCPU / VPS
EXL2 4.65bpw
Hugging Face

Zephyr 7B Beta

7B

HuggingFaceH4

DPO-aligned Mistral 7B. Classic choice for helpful, harmless chat baselines.

5.2 GB

min VRAM

33K

ctx

155

tok/s

Formats

GGUF
Consumer GPUMac / Apple SiliconCPU / VPS
GGUF Q4_K_M
Hugging Face

Stable LM 2 12B Chat

12B

Stability AI

Stability AI's 12B chat model. Solid general-purpose option for 16GB GPUs.

7.2 GB

min VRAM

4K

ctx

142

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple Silicon
AWQ INT4
Hugging Face

Falcon 3 10B Instruct

10B

TII UAE

Technology Innovation Institute's latest Falcon. Good multilingual and code mix.

6.2 GB

min VRAM

33K

ctx

155

tok/s

Formats

GGUFGPTQ
Consumer GPUMac / Apple Silicon
GPTQ INT4
Hugging Face

Jamba 1.5 Mini

12B

AI21 Labs

Hybrid SSM-Transformer with 256K context. Efficient long-document QA on 16GB.

7.5 GB

min VRAM

262K

ctx

125

tok/s

Formats

GGUFAWQ
Consumer GPUMac / Apple Silicon
AWQ INT4
Hugging Face

DBRX Instruct

132B

Databricks

MoE flagship (~36B active). Needs multi-GPU; strong code and reasoning at scale.

63.2 GB

min VRAM

33K

ctx

18

tok/s

Formats

GGUF
Pro GPU
GGUF Q3_K_M
Hugging Face