Deployment Cookbook
Battle-tested guides for running LLMs on real hardware
Run Llama 3.1 8B on a €20/month VPS
A complete guide to running a private LLM API on a budget Linux VPS using llama.cpp server mode.
Mac M3 Max: The Ultimate Local LLM Setup
Maximise your Apple Silicon with Ollama. Run multiple models, set up an OpenAI-compatible API, and tune Metal GPU layers.
Multi-Model API Server on RTX 4090 with vLLM
Serve multiple AWQ-quantized models with vLLM's continuous batching for production-grade throughput.
Docker Compose LLM Stack: Ollama + Open WebUI
A production-ready Docker Compose stack that gives you a local ChatGPT experience with one command.
What Can You Run on RTX 4060 Ti 16G?
A practical guide to picking the right model and quant level for NVIDIA's best budget 16GB card.
DeepSeek-R1 Distill 14B: EXL2 vs GGUF
Head-to-head on RTX 4090 — when to pick turboderp EXL2 over bartowski GGUF.
ExLlamaV2 on RTX 4090: Full Setup Guide
Install ExLlamaV2, load an EXL2 quant, and serve an OpenAI-compatible API in under 10 minutes.
Running 70B on Dual RTX 3090 with llama.cpp
Tensor-split across two 24GB cards to run Llama 3.1 70B or Qwen2.5 72B at Q4.
Qwen2.5-Coder 32B on a Single RTX 4090
The best open coding model that fits in 24GB — quant selection and tuning tips.
Mac M3 Pro: Realistic Model Limits
What actually fits in 18GB or 36GB unified memory with Ollama and llama.cpp.
llama.cpp on Windows with CUDA
Build llama.cpp with NVIDIA GPU support on Windows 11 — the path of least resistance for PC gamers.
TabbyAPI: ExLlamaV2 with a Web UI
Wrap ExLlamaV2 in TabbyAPI for a polished OpenAI-compatible server with streaming and model hot-swap.
Quantize Your Own Model to GGUF
Use llama.cpp's quantize tool to convert any HF model to GGUF Q4_K_M for local inference.
vLLM + AWQ in Production: Tuning Guide
gpu-memory-utilization, max-model-len, and batching knobs for stable API serving.
CPU Inference: OpenBLAS Tuning for llama.cpp
Maximize tokens/sec on a CPU-only VPS with thread count and BLAS backend tuning.
8GB GPU Starter Guide: 3060 / 4060 / 3070
The most common local LLM hardware tier — which models, quants, and context lengths actually fit in 8GB VRAM.
M1 / M2 Mac 8GB: Realistic Ollama Limits
Unified memory is shared with macOS — here is what actually works on base MacBooks without swapping.
WSL2 + Ollama GPU Passthrough on Windows
Run Ollama with NVIDIA GPU acceleration inside WSL2 — the most reliable Windows path for local LLMs.
Docker: Ollama with NVIDIA GPU Passthrough
Containerised Ollama with GPU access — isolate models, pin versions, and run alongside other services.
Nginx Reverse Proxy for Local LLM APIs
Put Ollama or llama.cpp behind Nginx with TLS, rate limiting, and a stable /v1 endpoint for your apps.
AMD GPU + llama.cpp via ROCm (Quick Start)
Run GGUF models on Radeon RX 7900 / 6800 series with llama.cpp HIP backend — what works and what does not.
Ollama on Windows (Native, No WSL)
Install the Windows Ollama app for the simplest path — GPU works on NVIDIA; AMD is CPU-only for now.