DATASHEET // VLLM

vLLM

High-throughput LLM serving for GPUs.

OPEN SOURCE24–80 GB VRAMSelf-hosted server

Actually FreeNo SignupOpen SourceWatermark-FreeAPI

Visit vLLMUPDATED 2026-05-14 · DIRECT LINK

github.com/vllm-project/vllm

HARDWARE REQUIREMENTS //

Self-hosted server · Datacenter GPU (80 GB+)

24–80 GB VRAM

Min VRAM

24 GB

Rec. VRAM

80 GB

Min RAM

64 GB

Rec. RAM

256 GB

Disk

500 GB

GPU class

Datacenter GPU

CUDA 12.xNo Apple SiliconGPU RequiredQuant: FP16, BF16, FP8 +3

For desktops you usually want Ollama / llama.cpp instead.

[ EDITORIAL PICK ]

Why we recommend vLLM

DERIVED FROM METADATA — NOT SPONSORED

Open source
Source is public — you can audit it, fork it, and you'll never lose access to your workflows if vLLM the company changes direction.
Top-tier pick
Power-user score 93/100 — consistently rated highly by people who use this every day, not just benchmark chasers.
6 quant formats
Supports FP16, BF16, FP8 and 3 more — you can dial VRAM use up or down to match your card.
Hosted API too
Both self-hostable and available as a hosted API — prototype on someone else's GPU, deploy on yours.

[ EVIDENCE NOTE ]

Documentation-led datasheet

This page summarizes upstream documentation, release information, and editorially reviewed catalogue fields. It is not presented as a hands-on benchmark. Verify changing requirements at the official project; report stale data through our corrections channel.

Memory guide →

AT-A-GLANCE SIGNALS //

DERIVED FROM THIS PAGE'S DATA

Install difficulty
Standard
A standard local install — download, install dependencies, point at your GPU.
Hardware comfort
Enthusiast
Needs 24 GB minimum — RTX 3090 / 4090 territory.
Ecosystem
Strong devkit
Open-source AND ships an API — easy to integrate, possible to host yourself.
Verification
Recent
Catalogue entry last updated 63 days ago — re-verification due soon.

[ COMMUNITY GUIDES & WORKFLOWS ]

Tutorials & deep-dives for vLLM

Hand-picked from YouTube, Reddit, GitHub, and the wider web. Each link goes straight to the source — we don't intercept or rewrite anything.

[ MORE IN THIS NICHE ]

Other local llm runners tools we rate

Three picks across different tradeoffs — so you don't end up with three near-clones of vLLM.

LIGHTEST HARDWARE //

Transformers

The library every LLM ships against first.

OPEN SOURCE2–24 GB VRAM

BEST FREE OPTION //

llama.cpp

The C++ inference engine powering most local LLMs.

OPEN SOURCECPU-CAPABLE

TOP QUALITY //

Ollama

One-command local LLM runtime.

OPEN SOURCECPU-CAPABLE

What is vLLM?

Production-grade inference server: PagedAttention, continuous batching, tensor parallelism. The standard when you need to serve thousands of requests/sec on H100s/A100s/L40s.

Pros & cons

✓ PROS

Best throughput in OSS
Tensor parallel across many GPUs
OpenAI-compatible server

– CONS

CUDA-only realistically
Aimed at servers, not desktops

What's actually free?

Free / OSS. Run it on your own datacenter GPUs.

✓ Actually FreeNo SignupOpen SourceWatermark-Free

Alternatives

llama.cpp

The C++ inference engine powering most local LLMs.

OPEN SOURCECPU-CAPABLE