Multimodal Vision Language Transformer Models

There are 89 multimodal vision language models tracked. 2 score above 50 (established tier). The highest-rated is om-ai-lab/VLM-R1 at 60/100 with 5,864 stars. 1 of the top 10 are actively maintained.

Get all 89 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

60
Established
2 fixie-ai/ultravox

A fast multimodal LLM for real-time voice

51
Established
3 KimMeen/Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting...

49
Emerging
4 ictnlp/LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction...

47
Emerging
5 deepseek-ai/Janus

Janus-Series: Unified Multimodal Understanding and Generation Models

47
Emerging
6 bytedance/SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

47
Emerging
7 NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and...

45
Emerging
8 showlab/Show-o

[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer...

44
Emerging
9 bytedance/video-SALMONN-2

video-SALMONN 2 is a powerful audio-visual large language model (LLM) that...

44
Emerging
10 cruiseresearchgroup/SensorLLM

[EMNLP 2025] Official implementation of "SensorLLM: Aligning Large Language...

43
Emerging
11 THU-SI/Spatial-MLLM

[NeurIPS 2025] Official implementation of Spatial-MLLM: Boosting MLLM...

40
Emerging
12 JAMESYJL/ShapeLLM-Omni

[NeurIPS 2025 Spotlight] A Native Multimodal LLM for 3D Generation and Understanding

38
Emerging
13 deepglint/unicom

Large-Scale Visual Representation Model

38
Emerging
14 InternLM/CapRL

[ICLR 2026] An official implementation of "CapRL: Stimulating Dense Image...

37
Emerging
15 InnovatorLM/Innovator-VL

Fully Open-source Multimodal Language Models for Science Discovery

36
Emerging
16 MIV-XJTU/JanusVLN

[ICLR2026] Official implementation for "JanusVLN: Decoupling Semantics and...

35
Emerging
17 nv-tlabs/LLaMA-Mesh

Unifying 3D Mesh Generation with Language Models

35
Emerging
18 tosiyuki/LLaVA-JP

LLaVA-JP is a Japanese VLM trained by LLaVA method

35
Emerging
19 jshilong/GPT4RoI

(ECCVW 2025)GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

34
Emerging
20 kohjingyu/fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to...

34
Emerging
21 TIGER-AI-Lab/QuickVideo

Quick Long Video Understanding [TMLR2025]

34
Emerging
22 JosefAlbers/VL-JEPA

VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) in MLX

34
Emerging
23 mlvlab/Flipped-VQA

Large Language Models are Temporal and Causal Reasoners for Video Question...

34
Emerging
24 antoyang/FrozenBiLM

[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional...

34
Emerging
25 kohjingyu/gill

🐟 Code and models for the NeurIPS 2023 paper "Generating Images with...

34
Emerging
26 OpenGVLab/VisionLLM

VisionLLM Series

34
Emerging
27 VITA-MLLM/Freeze-Omni

✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with...

34
Emerging
28 boheumd/MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term...

33
Emerging
29 Fsoft-AIC/Grasp-Anything

Dataset and Code for ICRA 2024 paper "Grasp-Anything: Large-scale Grasp...

33
Emerging
30 VPGTrans/VPGTrans

Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA,...

33
Emerging
31 FoundationVision/UniTok

[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding

33
Emerging
32 TIGER-AI-Lab/Vamba

Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid...

32
Emerging
33 qizekun/ShapeLLM

[ECCV 2024] ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

32
Emerging
34 JinhaoLee/WCA

[ICML 2024] Visual-Text Cross Alignment: Refining the Similarity Score in...

31
Emerging
35 iflytek/VLE

VLE: Vision-Language Encoder (VLE: 视觉-语言多模态预训练模型)

31
Emerging
36 sshh12/multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large...

31
Emerging
37 baaivision/EVE

EVE Series: Encoder-Free Vision-Language Models from BAAI

31
Emerging
38 zd11024/NaviLLM

[CVPR 2024] The code for paper 'Towards Learning a Generalist Model for...

30
Emerging
39 joslefaure/HERMES

[ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes...

30
Emerging
40 ximinng/LLM4SVG

[CVPR 2025] Official implementation for "Empowering LLMs to Understand and...

30
Emerging
41 SALT-NLP/LLaVAR

Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for...

30
Emerging
42 fangyuan-ksgk/Mini-LLaVA

A minimal implementation of LLaVA-style VLM with interleaved image & text &...

30
Emerging
43 AntonGuan/TimeOmni-1

[ICLR 2026] Official implementation of " 🦙 TimeOmni-1: Incentivizing Complex...

30
Emerging
44 MME-Benchmarks/Video-MME

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark...

29
Experimental
45 vbdi/divprune

[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large...

28
Experimental
46 umbertocappellazzo/Llama-AVSR

Official Pytorch implementation of "Large Language Models are Strong...

28
Experimental
47 ExplainableML/WaffleCLIP

Official repository for the ICCV 2023 paper: "Waffling around for...

28
Experimental
48 ziqipang/LM4VisualEncoding

[ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are...

28
Experimental
49 Tanveer81/ReVisionLLM

This is the official implementation of ReVisionLLM: Recursive...

28
Experimental
50 ExplainableML/Vision_by_Language

[ICLR 2024] Official repository for "Vision-by-Language for Training-Free...

28
Experimental
51 Wangbiao2/R1-Track

R1-Track: Direct Application of MLLMs to Visual Object Tracking via...

28
Experimental
52 Hon-Wong/VoRA

[Fully open] [Encoder-free MLLM] Vision as LoRA

27
Experimental
53 kkahatapitiya/LangRepo

Code for our ACL 2025 paper "Language Repository for Long Video Understanding"

27
Experimental
54 TencentARC/ST-LLM

[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language...

27
Experimental
55 cokeshao/HoliTom

[NeurIPS 2025] HoliTom: Holistic Token Merging for Fast Video Large Language Models

27
Experimental
56 YunzeMan/Lexicon3D

[NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D...

26
Experimental
57 Wang-ML-Lab/multimodal-needle-in-a-haystack

[NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking...

26
Experimental
58 yuecao0119/MMFuser

The official implementation of the paper "MMFuser: Multimodal Multi-Layer...

26
Experimental
59 peacelwh/VT-FSL

[NeurIPS 2025] VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

26
Experimental
60 xinyanghuang7/Basic-Visual-Language-Model

Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖

26
Experimental
61 HYUNJS/STTM

[ICCV 2025] Multi-Granular Spatio-Temporal Token Merging for Training-Free...

24
Experimental
62 MYMY-young/DelimScaling

[ICLR 2026] Official implementation of "Enhancing Multi-Image Understanding...

24
Experimental
63 baldoarbol/BodyShapeGPT

Fine-tuned LLMs generate accurate 3D human avatars from textual descriptions...

24
Experimental
64 ParadoxZW/LLaVA-UHD-Better

A bug-free and improved implementation of LLaVA-UHD, based on the code from...

24
Experimental
65 Jacksonlark/open-mllms

open llm for multimodal

23
Experimental
66 mbzuai-oryx/Video-LLaVA

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

23
Experimental
67 Victorwz/MLM_Filter

Official implementation of our paper "Finetuned Multimodal Language Models...

22
Experimental
68 2toinf/IVM

[NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"

22
Experimental
69 zengqunzhao/Exp-CLIP

[WACV'25 Oral] Enhancing Zero-Shot Facial Expression Recognition by LLM...

22
Experimental
70 agentic-learning-ai-lab/lifelong-memory

Code for LifelongMemory: Leveraging LLMs for Answering Queries in Long-form...

22
Experimental
71 WisconsinAIVision/YoLLaVA

🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant (NeurIPS 2024)

22
Experimental
72 InternLM/Visual-ERM

Official Implementation of "Visual-ERM: Reward Modeling for Visual Equivalence"

21
Experimental
73 astra-vision/LatteCLIP

[WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

21
Experimental
74 lizhaoliu-Lec/CG-VLM

This is the official repo for Contrastive Vision-Language Alignment Makes...

20
Experimental
75 SlytherinGe/RSTeller

Vision-Language Dataset for Remote Sensing

20
Experimental
76 UCSC-VLAA/Sight-Beyond-Text

[TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal...

20
Experimental
77 fatemehpesaran310/Text2Chart31

Official PyTorch implementation of "Text2Chart31: Instruction Tuning for...

19
Experimental
78 ProGamerGov/VLM-Captioning-Tools

Python scripts to use for captioning images with VLMs

19
Experimental
79 kyegomez/AudioFlamingo

Implementation of the model "AudioFlamingo" from the paper: "Audio Flamingo:...

19
Experimental
80 InternRobotics/Grounded_3D-LLM

Code&Data for Grounded 3D-LLM with Referent Tokens

17
Experimental
81 showlab/VisInContext

Official implementation of Leveraging Visual Tokens for Extended Text...

17
Experimental
82 Letian2003/MM_INF

An efficient multi-modal instruction-following data synthesis tool and the...

16
Experimental
83 ChenDelong1999/polite-flamingo

🦩 Official repository of paper "Visual Instruction Tuning with Polite...

15
Experimental
84 Traffic-Alpha/VLMLight

Official implementation of VLMLight

14
Experimental
85 bagh2178/GC-VLN

[CoRL 2025] GC-VLN: Instruction as Graph Constraints for Training-free...

14
Experimental
86 claws-lab/projection-in-MLLMs

Code and data for ACL 2024 paper on 'Cross-Modal Projection in Multimodal...

12
Experimental
87 ai4ce/LLM4VPR

Can multimodal LLM help visual place recognition?

12
Experimental
88 nkkbr/ViCA

This is the official implementation of ViCA2 (Visuospatial Cognitive...

12
Experimental
89 OpenM3D/M3DBench

[ECCV 2024] M3DBench introduces a comprehensive 3D instruction-following...

12
Experimental

Comparisons in this category