IntermediateEdge / Local 10 min read

ExLlamaV2 on RTX 4090: Full Setup Guide

Install ExLlamaV2, load an EXL2 quant, and serve an OpenAI-compatible API in under 10 minutes.

ExLlamaV2EXL2RTX 4090API

Install

ExLlamaV2 requires NVIDIA CUDA. Python 3.10+ recommended.

bash

pip install exllamav2
# Or clone for latest:
git clone https://github.com/turboderp/exllamav2
cd exllamav2 && pip install -r requirements.txt

Run inference

Point at a downloaded EXL2 model directory. Adjust -c for context length.

bash

python examples/chat.py \
  -m ./models/Llama-3.1-8B-exl2-4.65bpw \
  -c 4096 -gs 16

Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.