Back to Cookbook
IntermediateServer / VPS 12 min read
Multi-Model API Server on RTX 4090 with vLLM
Serve multiple AWQ-quantized models with vLLM's continuous batching for production-grade throughput.
vLLMAWQNVIDIARTX 4090APIDocker
Why vLLM + AWQ?
vLLM's PagedAttention and continuous batching deliver 3-5x throughput over naive inference. AWQ gives the best accuracy at INT4 for NVIDIA.
text
Model: Qwen2.5-7B-Instruct-AWQ
Hardware: RTX 4090 24GB
Throughput: ~220 tok/s (batch=1), ~1400 tok/s (batch=8)
VRAM used: 4.8 GB weights + KV cacheInstall vLLM
Install inside a virtual environment with CUDA 12.1.
bash
python3 -m venv venv && source venv/bin/activate
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121Serve the model
Start an OpenAI-compatible server with quantization support.
bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-AWQ \
--quantization awq \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--port 8000Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.