vLLM
High-throughput LLM serving for GPUs.
- Min VRAM
- 24 GB
- GPU class
- Datacenter GPU
- Quant
- FP16
Eighty gigabytes and up is datacenter territory — typically rented by the hour rather than owned. This is where you serve 70B+ LLMs at production scale, full fine-tune large models, and run multi-GPU video diffusion pipelines.
Sorted by best fit for this tier — tools designed around your VRAM budget first, then by our power-user score.
High-throughput LLM serving for GPUs.
World-foundation models for physical AI.
13B open-weight cinematic text-to-video.
Autoregressive video diffusion at 24 GB.
Genmo's 10-B open-weight T2V — the first 'genuinely fluid' OSS video model.
Serverless Python for GPU workloads.
Modern training framework — Flux, SDXL, SD3 LoRAs in YAML.
Microsoft Research's structured 3D representation model.
Memory-efficient T2V via pyramidal flow matching.
Open-weight video diffusion from Alibaba.
The standard SDXL/Flux LoRA training UI.
The INRIA original — train your own splats.
Tencent's open 3D generator — multi-view, PBR, ready-to-use meshes.
Real-time-ish open video diffusion from Lightricks.
Modern alternative trainer for SD/SDXL/Flux.
Stability's MMDiT flagship at 8B params.
Diffusion-based photorealistic upscaler.
Open-source text-to-video diffusion from THUDM.
Dead-simple Flux LoRA training in a Gradio UI.
Image-to-video diffusion — 25 frames, 14 or 25 steps.
Animation motion modules for ComfyUI.
12B parameter open-weight diffusion model.
Reference-image conditioning for ComfyUI.
The open framework for NeRF and Gaussian Splatting research.
The underlying scripts powering Kohya & most LoRA trainers.
Black Forest Labs' 4-step distilled Flux.
On-demand GPU pods for ComfyUI, vLLM, training.
Zero-shot voice cloning TTS — 15 s of audio is enough.
Production-leaning SD studio with canvas & batch.
Power-user front-end that wraps ComfyUI.
Add motion to any SD checkpoint via a motion module.
Generate synchronized audio for any silent video.
Meta's text-to-music & sound-effect model family.
NVIDIA's seconds-to-train hash-grid NeRF.
Suno's expressive transformer-based TTS.
The nodal workflow engine for serious diffusion.
All the ControlNet preprocessors in one node pack.
The workhorse open-weight image model.
The "A1111 for LLMs" — multi-loader local chat UI.
Stable Diffusion baked into a real painting app.
Single-image to 3D mesh in under a second on a 4090.
Open-weight text-to-audio — 47-second sound effects and music.
Novel-view synthesis — generate any angle from a single image.
Hugging Face's go-to library for every diffusion model.
The original SD power-user webUI.
GPU-accelerated upscaling, frame-interp, denoise.
Whisper + speaker diarisation + word-level timestamps.
Multilingual voice cloning in 6 seconds.
Optimized A1111 fork for low-VRAM cards.
Free RIFE-based frame interpolation.
Self-hosted, GPU-accelerated coding autocompletion.
The voice-changer that took over Discord.
Stable Diffusion XL, dialed to one button.
Slow, but the quality is worth the wait.
The library every LLM ships against first.
The reference open-source speech-to-text model.
Whisper, 4× faster, same accuracy. CTranslate2 backend.
The most active open face-swap toolkit.
The default OSS upscaler, still.