Deployment Cookbook

12 min read

Multi-Model API Server on RTX 4090 with vLLM

Serve multiple AWQ-quantized models with vLLM's continuous batching for production-grade throughput.

vLLMAWQNVIDIARTX 4090

DockerOllamaOpen WebUICompose

BeginnerDocker

5 min read

Docker Compose LLM Stack: Ollama + Open WebUI

A production-ready Docker Compose stack that gives you a local ChatGPT experience with one command.

BeginnerEdge / Local

7 min read

What Can You Run on RTX 4060 Ti 16G?

A practical guide to picking the right model and quant level for NVIDIA's best budget 16GB card.

RTX 4060 TiGGUFEXL2VRAM

DeepSeek-R1EXL2GGUFExLlamaV2

9 min read

DeepSeek-R1 Distill 14B: EXL2 vs GGUF

Head-to-head on RTX 4090 — when to pick turboderp EXL2 over bartowski GGUF.

10 min read

ExLlamaV2 on RTX 4090: Full Setup Guide

Install ExLlamaV2, load an EXL2 quant, and serve an OpenAI-compatible API in under 10 minutes.

ExLlamaV2EXL2RTX 4090API

70BMulti-GPUllama.cpptensor-split

AdvancedEdge / Local

11 min read

Running 70B on Dual RTX 3090 with llama.cpp

Tensor-split across two 24GB cards to run Llama 3.1 70B or Qwen2.5 72B at Q4.

Qwen2.5-Coder32BRTX 4090GGUF

8 min read

Qwen2.5-Coder 32B on a Single RTX 4090

The best open coding model that fits in 24GB — quant selection and tuning tips.

MacM3 ProOllamaUnified Memory

BeginnerMac / Apple

6 min read

Mac M3 Pro: Realistic Model Limits

What actually fits in 18GB or 36GB unified memory with Ollama and llama.cpp.

BeginnerEdge / Local

9 min read

llama.cpp on Windows with CUDA

Build llama.cpp with NVIDIA GPU support on Windows 11 — the path of least resistance for PC gamers.

Windowsllama.cppCUDAGGUF

8 min read

TabbyAPI: ExLlamaV2 with a Web UI

Wrap ExLlamaV2 in TabbyAPI for a polished OpenAI-compatible server with streaming and model hot-swap.

TabbyAPIExLlamaV2APIEXL2

GGUFllama.cppquantizecustom

AdvancedEdge / Local

12 min read

Quantize Your Own Model to GGUF

Use llama.cpp's quantize tool to convert any HF model to GGUF Q4_K_M for local inference.

AdvancedServer / VPS

10 min read

vLLM + AWQ in Production: Tuning Guide

gpu-memory-utilization, max-model-len, and batching knobs for stable API serving.

vLLMAWQproductionAPI

7 min read

CPU Inference: OpenBLAS Tuning for llama.cpp

Maximize tokens/sec on a CPU-only VPS with thread count and BLAS backend tuning.

CPUllama.cppOpenBLASVPS

8GB VRAMRTX 3060RTX 4060GGUF

BeginnerEdge / Local

8 min read

8GB GPU Starter Guide: 3060 / 4060 / 3070

The most common local LLM hardware tier — which models, quants, and context lengths actually fit in 8GB VRAM.

BeginnerMac / Apple

7 min read

M1 / M2 Mac 8GB: Realistic Ollama Limits

Unified memory is shared with macOS — here is what actually works on base MacBooks without swapping.

M1M28GB RAMOllama

10 min read

WSL2 + Ollama GPU Passthrough on Windows

Run Ollama with NVIDIA GPU acceleration inside WSL2 — the most reliable Windows path for local LLMs.

WSL2WindowsOllamaNVIDIA

IntermediateDocker

9 min read

Docker: Ollama with NVIDIA GPU Passthrough

Containerised Ollama with GPU access — isolate models, pin versions, and run alongside other services.

DockerOllamaNVIDIAGPU

WindowsOllamaNVIDIADesktop

11 min read

Nginx Reverse Proxy for Local LLM APIs

Put Ollama or llama.cpp behind Nginx with TLS, rate limiting, and a stable /v1 endpoint for your apps.

AMD GPU + llama.cpp via ROCm (Quick Start)

Run GGUF models on Radeon RX 7900 / 6800 series with llama.cpp HIP backend — what works and what does not.

Ollama on Windows (Native, No WSL)

Install the Windows Ollama app for the simplest path — GPU works on NVIDIA; AMD is CPU-only for now.