IntermediateServer / VPS 7 min read

CPU Inference: OpenBLAS Tuning for llama.cpp

Maximize tokens/sec on a CPU-only VPS with thread count and BLAS backend tuning.

CPUllama.cppOpenBLASVPS

Thread count

Set -t to physical core count (not hyperthreads). Use -tb 1 for single-batch interactive use.

bash

./build/bin/llama-server \
  -m ./models/Llama-3.1-8B-Q4_K_M.gguf \
  -t 8 -tb 1 -c 4096 \
  --host 0.0.0.0 --port 8080

Expected performance

A 8-core VPS with OpenBLAS achieves ~8–15 tok/s on 8B Q4_K_M. Usable for personal API, not production throughput.

text

Hetzner CX32 (8 vCPU, 32GB): ~12 tok/s
AWS c7i.2xlarge (8 vCPU): ~15 tok/s

Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.