llama.cpp
The C++ inference engine powering most local LLMs.
Open SourceCPU-capable
Actually FreeNo SignupOpen SourceWatermark-Free
High-throughput LLM serving for GPUs.
Self-hosted server · Datacenter GPU (80 GB+)
For desktops you usually want Ollama / llama.cpp instead.
Production-grade inference server: PagedAttention, continuous batching, tensor parallelism. The standard when you need to serve thousands of requests/sec on H100s/A100s/L40s.
Free / OSS. Run it on your own datacenter GPUs.
The C++ inference engine powering most local LLMs.