Multimodal Vision Language Diffusion Models

There are 16 multimodal vision language models tracked. 1 score above 50 (established tier). The highest-rated is zai-org/CogVideo at 52/100 with 12,515 stars.

Get all 16 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=diffusion&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	zai-org/CogVideo text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)	52	Established	12,515	Python
2	zhaorw02/DeepMesh [ICCV 2025] Official code of DeepMesh: Auto-Regressive Artist-mesh Creation...	38	Emerging	700	Python
3	YangLing0818/RPG-DiffusionMaster [ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and...	36	Emerging	1,843	Jupyter Notebook
4	thu-nics/FrameFusion [ICCV'25] The official code of paper "Combining Similarity and Importance...	31	Emerging	71	Python
5	Yushi-Hu/tifa TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with...	30	Emerging	182	Python
6	OpenMeshLab/MeshXL [NeurIPS 2024] MeshXL: Neural Coordinate Field for Generative 3D Foundation...	29	Experimental	328	Python
7	ByteVisionLab/TokenFlow [CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for...	29	Experimental	449	Python
8	j-min/DSG Davidsonian Scene Graph (DSG) for Text-to-Image Evaluation (ICLR 2024)	27	Experimental	105	Jupyter Notebook
9	YangLing0818/VideoTetris [NeurIPS 2024] VideoTetris: Towards Compositional Text-To-Video Generation	26	Experimental	240	Python
10	jqin4749/MindVideo Official code base for MinD-Video	24	Experimental	390	Python
11	showlab/VisorGPT [NeurIPS 2023] Customize spatial layouts for conditional image synthesis...	24	Experimental	137	Python
12	InternRobotics/UniHSI [ICLR 2024 Spotlight] Unified Human-Scene Interaction via Prompted Chain-of-Contacts	24	Experimental	245	Python
13	GradientSpaces/respace Code for "ReSpace: Text-Driven 3D Indoor Scene Synthesis and Editing with...	23	Experimental	63	Python
14	YangLing0818/EditWorld [ACM Multimedia 2025 Datasets Track] EditWorld: Simulating World Dynamics...	21	Experimental	140	Python
15	DAMO-NLP-SG/DiGIT [NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling:...	21	Experimental	79	Python
16	LayoutLLM-T2I/LayoutLLM-T2I Code for ACM MM'23 paper: LayoutLLM-T2I: Eliciting Layout Guidance from LLM...	12	Experimental	51	Python