IntermediateServer / VPS 12 min read

Multi-Model API Server on RTX 4090 with vLLM

Serve multiple AWQ-quantized models with vLLM's continuous batching for production-grade throughput.

vLLMAWQNVIDIARTX 4090APIDocker

Why vLLM + AWQ?

vLLM's PagedAttention and continuous batching deliver 3-5x throughput over naive inference. AWQ gives the best accuracy at INT4 for NVIDIA.

text

Model: Qwen2.5-7B-Instruct-AWQ
Hardware: RTX 4090 24GB
Throughput: ~220 tok/s (batch=1), ~1400 tok/s (batch=8)
VRAM used: 4.8 GB weights + KV cache

Install vLLM

Install inside a virtual environment with CUDA 12.1.

bash

python3 -m venv venv && source venv/bin/activate
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

Serve the model

Start an OpenAI-compatible server with quantization support.

bash

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --port 8000

Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.