Multimodal Vision Language Diffusion Models
There are 16 multimodal vision language models tracked. 1 score above 50 (established tier). The highest-rated is zai-org/CogVideo at 52/100 with 12,515 stars.
Get all 16 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=diffusion&subcategory=multimodal-vision-language&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
zai-org/CogVideo
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023) |
|
Established |
| 2 |
zhaorw02/DeepMesh
[ICCV 2025] Official code of DeepMesh: Auto-Regressive Artist-mesh Creation... |
|
Emerging |
| 3 |
YangLing0818/RPG-DiffusionMaster
[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and... |
|
Emerging |
| 4 |
thu-nics/FrameFusion
[ICCV'25] The official code of paper "Combining Similarity and Importance... |
|
Emerging |
| 5 |
Yushi-Hu/tifa
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with... |
|
Emerging |
| 6 |
OpenMeshLab/MeshXL
[NeurIPS 2024] MeshXL: Neural Coordinate Field for Generative 3D Foundation... |
|
Experimental |
| 7 |
ByteVisionLab/TokenFlow
[CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for... |
|
Experimental |
| 8 |
j-min/DSG
Davidsonian Scene Graph (DSG) for Text-to-Image Evaluation (ICLR 2024) |
|
Experimental |
| 9 |
YangLing0818/VideoTetris
[NeurIPS 2024] VideoTetris: Towards Compositional Text-To-Video Generation |
|
Experimental |
| 10 |
jqin4749/MindVideo
Official code base for MinD-Video |
|
Experimental |
| 11 |
showlab/VisorGPT
[NeurIPS 2023] Customize spatial layouts for conditional image synthesis... |
|
Experimental |
| 12 |
InternRobotics/UniHSI
[ICLR 2024 Spotlight] Unified Human-Scene Interaction via Prompted Chain-of-Contacts |
|
Experimental |
| 13 |
GradientSpaces/respace
Code for "ReSpace: Text-Driven 3D Indoor Scene Synthesis and Editing with... |
|
Experimental |
| 14 |
YangLing0818/EditWorld
[ACM Multimedia 2025 Datasets Track] EditWorld: Simulating World Dynamics... |
|
Experimental |
| 15 |
DAMO-NLP-SG/DiGIT
[NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling:... |
|
Experimental |
| 16 |
LayoutLLM-T2I/LayoutLLM-T2I
Code for ACM MM'23 paper: LayoutLLM-T2I: Eliciting Layout Guidance from LLM... |
|
Experimental |