vLLM
High-throughput LLM serving for GPUs.
- Min VRAM
- 24 GB
- GPU class
- Datacenter GPU
- Quant
- FP16
A 6 GB card is the practical floor for serious local AI in 2026. With aggressive quantization (NF4, GGUF Q4), you can run SDXL, Flux dev, and small LLMs. Long video diffusion stays out of reach — but you can absolutely learn the entire stack on this hardware.
Sorted by best fit for this tier — tools designed around your VRAM budget first, then by our power-user score.
High-throughput LLM serving for GPUs.
World-foundation models for physical AI.
13B open-weight cinematic text-to-video.
Autoregressive video diffusion at 24 GB.
Genmo's 10-B open-weight T2V — the first 'genuinely fluid' OSS video model.
Serverless Python for GPU workloads.
Modern training framework — Flux, SDXL, SD3 LoRAs in YAML.
Microsoft Research's structured 3D representation model.
Memory-efficient T2V via pyramidal flow matching.
Open-weight video diffusion from Alibaba.
The standard SDXL/Flux LoRA training UI.
The INRIA original — train your own splats.
Tencent's open 3D generator — multi-view, PBR, ready-to-use meshes.
Real-time-ish open video diffusion from Lightricks.
Modern alternative trainer for SD/SDXL/Flux.
Stability's MMDiT flagship at 8B params.
Diffusion-based photorealistic upscaler.
Open-source text-to-video diffusion from THUDM.
Dead-simple Flux LoRA training in a Gradio UI.
Image-to-video diffusion — 25 frames, 14 or 25 steps.
Animation motion modules for ComfyUI.
The nodal workflow engine for serious diffusion.
All the ControlNet preprocessors in one node pack.
The workhorse open-weight image model.
The "A1111 for LLMs" — multi-loader local chat UI.
Stable Diffusion baked into a real painting app.
Single-image to 3D mesh in under a second on a 4090.
Open-weight text-to-audio — 47-second sound effects and music.
Novel-view synthesis — generate any angle from a single image.
Hugging Face's go-to library for every diffusion model.
The original SD power-user webUI.
GPU-accelerated upscaling, frame-interp, denoise.
Whisper + speaker diarisation + word-level timestamps.
Multilingual voice cloning in 6 seconds.
Optimized A1111 fork for low-VRAM cards.
Free RIFE-based frame interpolation.
Self-hosted, GPU-accelerated coding autocompletion.
The voice-changer that took over Discord.
Stable Diffusion XL, dialed to one button.
Slow, but the quality is worth the wait.
The library every LLM ships against first.
The reference open-source speech-to-text model.
Whisper, 4× faster, same accuracy. CTranslate2 backend.
The most active open face-swap toolkit.
The default OSS upscaler, still.