Multimodal Vision Language Transformer Models

There are 89 multimodal vision language models tracked. 2 score above 50 (established tier). The highest-rated is om-ai-lab/VLM-R1 at 60/100 with 5,864 stars. 1 of the top 10 are actively maintained.

Get all 89 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	om-ai-lab/VLM-R1 Solve Visual Understanding with Reinforced VLMs	60	Established	5,864	Python
2	fixie-ai/ultravox A fast multimodal LLM for real-time voice	51	Established	4,377	Python
3	KimMeen/Time-LLM [ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting...	49	Emerging	2,563	Python
4	ictnlp/LLaMA-Omni LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction...	47	Emerging	3,128	Python
5	deepseek-ai/Janus Janus-Series: Unified Multimodal Understanding and Generation Models	47	Emerging	17,708	Python
6	bytedance/SALMONN SALMONN family: A suite of advanced multi-modal LLMs	47	Emerging	1,392	—
7	NVlabs/OmniVinci OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and...	45	Emerging	639	Python
8	showlab/Show-o [ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer...	44	Emerging	1,894	Python
9	bytedance/video-SALMONN-2 video-SALMONN 2 is a powerful audio-visual large language model (LLM) that...	44	Emerging	167	Python
10	cruiseresearchgroup/SensorLLM [EMNLP 2025] Official implementation of "SensorLLM: Aligning Large Language...	43	Emerging	83	Python
11	THU-SI/Spatial-MLLM [NeurIPS 2025] Official implementation of Spatial-MLLM: Boosting MLLM...	40	Emerging	447	Python
12	JAMESYJL/ShapeLLM-Omni [NeurIPS 2025 Spotlight] A Native Multimodal LLM for 3D Generation and Understanding	38	Emerging	549	Python
13	deepglint/unicom Large-Scale Visual Representation Model	38	Emerging	704	Python
14	InternLM/CapRL [ICLR 2026] An official implementation of "CapRL: Stimulating Dense Image...	37	Emerging	193	Python
15	InnovatorLM/Innovator-VL Fully Open-source Multimodal Language Models for Science Discovery	36	Emerging	130	Python
16	MIV-XJTU/JanusVLN [ICLR2026] Official implementation for "JanusVLN: Decoupling Semantics and...	35	Emerging	508	Python
17	nv-tlabs/LLaMA-Mesh Unifying 3D Mesh Generation with Language Models	35	Emerging	1,145	Python
18	tosiyuki/LLaVA-JP LLaVA-JP is a Japanese VLM trained by LLaVA method	35	Emerging	64	Python
19	jshilong/GPT4RoI (ECCVW 2025)GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest	34	Emerging	551	Python
20	kohjingyu/fromage 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to...	34	Emerging	486	Jupyter Notebook
21	TIGER-AI-Lab/QuickVideo Quick Long Video Understanding [TMLR2025]	34	Emerging	76	Python
22	JosefAlbers/VL-JEPA VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) in MLX	34	Emerging	76	Python
23	mlvlab/Flipped-VQA Large Language Models are Temporal and Causal Reasoners for Video Question...	34	Emerging	78	Python
24	antoyang/FrozenBiLM [NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional...	34	Emerging	158	Python
25	kohjingyu/gill 🐟 Code and models for the NeurIPS 2023 paper "Generating Images with...	34	Emerging	471	Jupyter Notebook
26	OpenGVLab/VisionLLM VisionLLM Series	34	Emerging	1,137	Python
27	VITA-MLLM/Freeze-Omni ✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with...	34	Emerging	369	Python
28	boheumd/MA-LMM (2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term...	33	Emerging	347	Python
29	Fsoft-AIC/Grasp-Anything Dataset and Code for ICRA 2024 paper "Grasp-Anything: Large-scale Grasp...	33	Emerging	219	Python
30	VPGTrans/VPGTrans Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA,...	33	Emerging	269	Python
31	FoundationVision/UniTok [NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding	33	Emerging	517	Python
32	TIGER-AI-Lab/Vamba Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid...	32	Emerging	101	Python
33	qizekun/ShapeLLM [ECCV 2024] ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	32	Emerging	228	Python
34	JinhaoLee/WCA [ICML 2024] Visual-Text Cross Alignment: Refining the Similarity Score in...	31	Emerging	19	Python
35	iflytek/VLE VLE: Vision-Language Encoder (VLE: 视觉-语言多模态预训练模型)	31	Emerging	194	Python
36	sshh12/multi_token Embed arbitrary modalities (images, audio, documents, etc) into large...	31	Emerging	191	Python
37	baaivision/EVE EVE Series: Encoder-Free Vision-Language Models from BAAI	31	Emerging	368	Python
38	zd11024/NaviLLM [CVPR 2024] The code for paper 'Towards Learning a Generalist Model for...	30	Emerging	229	Python
39	joslefaure/HERMES [ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes...	30	Emerging	38	Python
40	ximinng/LLM4SVG [CVPR 2025] Official implementation for "Empowering LLMs to Understand and...	30	Emerging	617	Python
41	SALT-NLP/LLaVAR Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for...	30	Emerging	269	Python
42	fangyuan-ksgk/Mini-LLaVA A minimal implementation of LLaVA-style VLM with interleaved image & text &...	30	Emerging	98	Python
43	AntonGuan/TimeOmni-1 [ICLR 2026] Official implementation of " 🦙 TimeOmni-1: Incentivizing Complex...	30	Emerging	18	Python
44	MME-Benchmarks/Video-MME ✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark...	29	Experimental	732	—
45	vbdi/divprune [CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large...	28	Experimental	71	Python
46	umbertocappellazzo/Llama-AVSR Official Pytorch implementation of "Large Language Models are Strong...	28	Experimental	57	Python
47	ExplainableML/WaffleCLIP Official repository for the ICCV 2023 paper: "Waffling around for...	28	Experimental	61	Python
48	ziqipang/LM4VisualEncoding [ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are...	28	Experimental	246	Python
49	Tanveer81/ReVisionLLM This is the official implementation of ReVisionLLM: Recursive...	28	Experimental	43	Python
50	ExplainableML/Vision_by_Language [ICLR 2024] Official repository for "Vision-by-Language for Training-Free...	28	Experimental	84	Python
51	Wangbiao2/R1-Track R1-Track: Direct Application of MLLMs to Visual Object Tracking via...	28	Experimental	66	Python
52	Hon-Wong/VoRA [Fully open] [Encoder-free MLLM] Vision as LoRA	27	Experimental	379	Python
53	kkahatapitiya/LangRepo Code for our ACL 2025 paper "Language Repository for Long Video Understanding"	27	Experimental	36	Python
54	TencentARC/ST-LLM [ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language...	27	Experimental	151	Python
55	cokeshao/HoliTom [NeurIPS 2025] HoliTom: Holistic Token Merging for Fast Video Large Language Models	27	Experimental	72	Python
56	YunzeMan/Lexicon3D [NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D...	26	Experimental	100	Python
57	Wang-ML-Lab/multimodal-needle-in-a-haystack [NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking...	26	Experimental	54	Python
58	yuecao0119/MMFuser The official implementation of the paper "MMFuser: Multimodal Multi-Layer...	26	Experimental	64	Python
59	peacelwh/VT-FSL [NeurIPS 2025] VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning	26	Experimental	31	Python
60	xinyanghuang7/Basic-Visual-Language-Model Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖	26	Experimental	47	Python
61	HYUNJS/STTM [ICCV 2025] Multi-Granular Spatio-Temporal Token Merging for Training-Free...	24	Experimental	57	Python
62	MYMY-young/DelimScaling [ICLR 2026] Official implementation of "Enhancing Multi-Image Understanding...	24	Experimental	14	Python
63	baldoarbol/BodyShapeGPT Fine-tuned LLMs generate accurate 3D human avatars from textual descriptions...	24	Experimental	37	Python
64	ParadoxZW/LLaVA-UHD-Better A bug-free and improved implementation of LLaVA-UHD, based on the code from...	24	Experimental	35	Python
65	Jacksonlark/open-mllms open llm for multimodal	23	Experimental	20	—
66	mbzuai-oryx/Video-LLaVA PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models	23	Experimental	262	Python
67	Victorwz/MLM_Filter Official implementation of our paper "Finetuned Multimodal Language Models...	22	Experimental	69	Python
68	2toinf/IVM [NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"	22	Experimental	42	Jupyter Notebook
69	zengqunzhao/Exp-CLIP [WACV'25 Oral] Enhancing Zero-Shot Facial Expression Recognition by LLM...	22	Experimental	56	Python
70	agentic-learning-ai-lab/lifelong-memory Code for LifelongMemory: Leveraging LLMs for Answering Queries in Long-form...	22	Experimental	28	Python
71	WisconsinAIVision/YoLLaVA 🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant (NeurIPS 2024)	22	Experimental	121	Python
72	InternLM/Visual-ERM Official Implementation of "Visual-ERM: Reward Modeling for Visual Equivalence"	21	Experimental	25	Python
73	astra-vision/LatteCLIP [WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts	21	Experimental	10	Jupyter Notebook
74	lizhaoliu-Lec/CG-VLM This is the official repo for Contrastive Vision-Language Alignment Makes...	20	Experimental	20	—
75	SlytherinGe/RSTeller Vision-Language Dataset for Remote Sensing	20	Experimental	40	Python
76	UCSC-VLAA/Sight-Beyond-Text [TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal...	20	Experimental	20	Python
77	fatemehpesaran310/Text2Chart31 Official PyTorch implementation of "Text2Chart31: Instruction Tuning for...	19	Experimental	24	Python
78	ProGamerGov/VLM-Captioning-Tools Python scripts to use for captioning images with VLMs	19	Experimental	45	Python
79	kyegomez/AudioFlamingo Implementation of the model "AudioFlamingo" from the paper: "Audio Flamingo:...	19	Experimental	40	Python
80	InternRobotics/Grounded_3D-LLM Code&Data for Grounded 3D-LLM with Referent Tokens	17	Experimental	134	Python
81	showlab/VisInContext Official implementation of Leveraging Visual Tokens for Extended Text...	17	Experimental	28	Python
82	Letian2003/MM_INF An efficient multi-modal instruction-following data synthesis tool and the...	16	Experimental	39	Python
83	ChenDelong1999/polite-flamingo 🦩 Official repository of paper "Visual Instruction Tuning with Polite...	15	Experimental	65	Python
84	Traffic-Alpha/VLMLight Official implementation of VLMLight	14	Experimental	30	Python
85	bagh2178/GC-VLN [CoRL 2025] GC-VLN: Instruction as Graph Constraints for Training-free...	14	Experimental	64	—
86	claws-lab/projection-in-MLLMs Code and data for ACL 2024 paper on 'Cross-Modal Projection in Multimodal...	12	Experimental	19	Python
87	ai4ce/LLM4VPR Can multimodal LLM help visual place recognition?	12	Experimental	45	Python
88	nkkbr/ViCA This is the official implementation of ViCA2 (Visuospatial Cognitive...	12	Experimental	12	Python
89	OpenM3D/M3DBench [ECCV 2024] M3DBench introduces a comprehensive 3D instruction-following...	12	Experimental	61	Python

Comparisons in this category

SALMONN and video-SALMONN-2 (47 vs 44)