FoundationVision/VAR
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
Implements next-scale prediction, a coarse-to-fine autoregressive approach where token generation proceeds by resolution levels rather than raster-scan order, enabling transformers to match or exceed diffusion model quality. Leverages a discrete VAE bottleneck and PyTorch 2.0+ with optional Flash-Attention and xformers backends for accelerated transformer inference on ImageNet-scale datasets. Provides pre-trained checkpoints (310M–2.3B parameters) on Hugging Face alongside a minimal training pipeline for custom image datasets.
8,641 stars.
Stars
8,641
Forks
563
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Nov 10, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/diffusion/FoundationVision/VAR"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
NVlabs/Sana
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
nerdyrodent/VQGAN-CLIP
Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.
huggingface/finetrainers
Scalable and memory-optimized training of diffusion models
eps696/aphantasia
CLIP + FFT/DWT/RGB = text to image/video
AssemblyAI-Community/MinImagen
MinImagen: A minimal implementation of the Imagen text-to-image model