vLLM
High-throughput LLM serving for GPUs.
- Min VRAM
- 24 GB
- GPU class
- Datacenter GPU
- Quant
- FP16
Forty-eight gigabytes is workstation territory. Now you can run the biggest open-weight video models at full precision, serve 70B LLMs through vLLM, and start fine-tuning instead of just LoRA-ing. The RTX 5090, RTX A6000, and L40S are the typical homes for this tier.
Sorted by best fit for this tier — tools designed around your VRAM budget first, then by our power-user score.
High-throughput LLM serving for GPUs.
World-foundation models for physical AI.
13B open-weight cinematic text-to-video.
Autoregressive video diffusion at 24 GB.
Genmo's 10-B open-weight T2V — the first 'genuinely fluid' OSS video model.
Serverless Python for GPU workloads.
Modern training framework — Flux, SDXL, SD3 LoRAs in YAML.
Microsoft Research's structured 3D representation model.
Memory-efficient T2V via pyramidal flow matching.
Open-weight video diffusion from Alibaba.
The standard SDXL/Flux LoRA training UI.
The INRIA original — train your own splats.
Tencent's open 3D generator — multi-view, PBR, ready-to-use meshes.
Real-time-ish open video diffusion from Lightricks.
Modern alternative trainer for SD/SDXL/Flux.
Stability's MMDiT flagship at 8B params.
Diffusion-based photorealistic upscaler.
Open-source text-to-video diffusion from THUDM.
Dead-simple Flux LoRA training in a Gradio UI.
Image-to-video diffusion — 25 frames, 14 or 25 steps.
Animation motion modules for ComfyUI.
Hugging Face's go-to library for every diffusion model.
The original SD power-user webUI.
GPU-accelerated upscaling, frame-interp, denoise.
Whisper + speaker diarisation + word-level timestamps.
Multilingual voice cloning in 6 seconds.
Optimized A1111 fork for low-VRAM cards.
Free RIFE-based frame interpolation.
Self-hosted, GPU-accelerated coding autocompletion.
The voice-changer that took over Discord.
Stable Diffusion XL, dialed to one button.
Slow, but the quality is worth the wait.
The library every LLM ships against first.
The reference open-source speech-to-text model.
Whisper, 4× faster, same accuracy. CTranslate2 backend.
The most active open face-swap toolkit.
The default OSS upscaler, still.