Multimodal Vision Language Transformer Models
There are 89 multimodal vision language models tracked. 2 score above 50 (established tier). The highest-rated is om-ai-lab/VLM-R1 at 60/100 with 5,864 stars. 1 of the top 10 are actively maintained.
Get all 89 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=multimodal-vision-language&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs |
|
Established |
| 2 |
fixie-ai/ultravox
A fast multimodal LLM for real-time voice |
|
Established |
| 3 |
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting... |
|
Emerging |
| 4 |
ictnlp/LLaMA-Omni
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction... |
|
Emerging |
| 5 |
deepseek-ai/Janus
Janus-Series: Unified Multimodal Understanding and Generation Models |
|
Emerging |
| 6 |
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs |
|
Emerging |
| 7 |
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and... |
|
Emerging |
| 8 |
showlab/Show-o
[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer... |
|
Emerging |
| 9 |
bytedance/video-SALMONN-2
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that... |
|
Emerging |
| 10 |
cruiseresearchgroup/SensorLLM
[EMNLP 2025] Official implementation of "SensorLLM: Aligning Large Language... |
|
Emerging |
| 11 |
THU-SI/Spatial-MLLM
[NeurIPS 2025] Official implementation of Spatial-MLLM: Boosting MLLM... |
|
Emerging |
| 12 |
JAMESYJL/ShapeLLM-Omni
[NeurIPS 2025 Spotlight] A Native Multimodal LLM for 3D Generation and Understanding |
|
Emerging |
| 13 |
deepglint/unicom
Large-Scale Visual Representation Model |
|
Emerging |
| 14 |
InternLM/CapRL
[ICLR 2026] An official implementation of "CapRL: Stimulating Dense Image... |
|
Emerging |
| 15 |
InnovatorLM/Innovator-VL
Fully Open-source Multimodal Language Models for Science Discovery |
|
Emerging |
| 16 |
MIV-XJTU/JanusVLN
[ICLR2026] Official implementation for "JanusVLN: Decoupling Semantics and... |
|
Emerging |
| 17 |
nv-tlabs/LLaMA-Mesh
Unifying 3D Mesh Generation with Language Models |
|
Emerging |
| 18 |
tosiyuki/LLaVA-JP
LLaVA-JP is a Japanese VLM trained by LLaVA method |
|
Emerging |
| 19 |
jshilong/GPT4RoI
(ECCVW 2025)GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest |
|
Emerging |
| 20 |
kohjingyu/fromage
🧀 Code and models for the ICML 2023 paper "Grounding Language Models to... |
|
Emerging |
| 21 |
TIGER-AI-Lab/QuickVideo
Quick Long Video Understanding [TMLR2025] |
|
Emerging |
| 22 |
JosefAlbers/VL-JEPA
VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) in MLX |
|
Emerging |
| 23 |
mlvlab/Flipped-VQA
Large Language Models are Temporal and Causal Reasoners for Video Question... |
|
Emerging |
| 24 |
antoyang/FrozenBiLM
[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional... |
|
Emerging |
| 25 |
kohjingyu/gill
🐟 Code and models for the NeurIPS 2023 paper "Generating Images with... |
|
Emerging |
| 26 |
OpenGVLab/VisionLLM
VisionLLM Series |
|
Emerging |
| 27 |
VITA-MLLM/Freeze-Omni
✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with... |
|
Emerging |
| 28 |
boheumd/MA-LMM
(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term... |
|
Emerging |
| 29 |
Fsoft-AIC/Grasp-Anything
Dataset and Code for ICRA 2024 paper "Grasp-Anything: Large-scale Grasp... |
|
Emerging |
| 30 |
VPGTrans/VPGTrans
Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA,... |
|
Emerging |
| 31 |
FoundationVision/UniTok
[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding |
|
Emerging |
| 32 |
TIGER-AI-Lab/Vamba
Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid... |
|
Emerging |
| 33 |
qizekun/ShapeLLM
[ECCV 2024] ShapeLLM: Universal 3D Object Understanding for Embodied Interaction |
|
Emerging |
| 34 |
JinhaoLee/WCA
[ICML 2024] Visual-Text Cross Alignment: Refining the Similarity Score in... |
|
Emerging |
| 35 |
iflytek/VLE
VLE: Vision-Language Encoder (VLE: 视觉-语言多模态预训练模型) |
|
Emerging |
| 36 |
sshh12/multi_token
Embed arbitrary modalities (images, audio, documents, etc) into large... |
|
Emerging |
| 37 |
baaivision/EVE
EVE Series: Encoder-Free Vision-Language Models from BAAI |
|
Emerging |
| 38 |
zd11024/NaviLLM
[CVPR 2024] The code for paper 'Towards Learning a Generalist Model for... |
|
Emerging |
| 39 |
joslefaure/HERMES
[ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes... |
|
Emerging |
| 40 |
ximinng/LLM4SVG
[CVPR 2025] Official implementation for "Empowering LLMs to Understand and... |
|
Emerging |
| 41 |
SALT-NLP/LLaVAR
Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for... |
|
Emerging |
| 42 |
fangyuan-ksgk/Mini-LLaVA
A minimal implementation of LLaVA-style VLM with interleaved image & text &... |
|
Emerging |
| 43 |
AntonGuan/TimeOmni-1
[ICLR 2026] Official implementation of " 🦙 TimeOmni-1: Incentivizing Complex... |
|
Emerging |
| 44 |
MME-Benchmarks/Video-MME
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark... |
|
Experimental |
| 45 |
vbdi/divprune
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large... |
|
Experimental |
| 46 |
umbertocappellazzo/Llama-AVSR
Official Pytorch implementation of "Large Language Models are Strong... |
|
Experimental |
| 47 |
ExplainableML/WaffleCLIP
Official repository for the ICCV 2023 paper: "Waffling around for... |
|
Experimental |
| 48 |
ziqipang/LM4VisualEncoding
[ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are... |
|
Experimental |
| 49 |
Tanveer81/ReVisionLLM
This is the official implementation of ReVisionLLM: Recursive... |
|
Experimental |
| 50 |
ExplainableML/Vision_by_Language
[ICLR 2024] Official repository for "Vision-by-Language for Training-Free... |
|
Experimental |
| 51 |
Wangbiao2/R1-Track
R1-Track: Direct Application of MLLMs to Visual Object Tracking via... |
|
Experimental |
| 52 |
Hon-Wong/VoRA
[Fully open] [Encoder-free MLLM] Vision as LoRA |
|
Experimental |
| 53 |
kkahatapitiya/LangRepo
Code for our ACL 2025 paper "Language Repository for Long Video Understanding" |
|
Experimental |
| 54 |
TencentARC/ST-LLM
[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language... |
|
Experimental |
| 55 |
cokeshao/HoliTom
[NeurIPS 2025] HoliTom: Holistic Token Merging for Fast Video Large Language Models |
|
Experimental |
| 56 |
YunzeMan/Lexicon3D
[NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D... |
|
Experimental |
| 57 |
Wang-ML-Lab/multimodal-needle-in-a-haystack
[NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking... |
|
Experimental |
| 58 |
yuecao0119/MMFuser
The official implementation of the paper "MMFuser: Multimodal Multi-Layer... |
|
Experimental |
| 59 |
peacelwh/VT-FSL
[NeurIPS 2025] VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning |
|
Experimental |
| 60 |
xinyanghuang7/Basic-Visual-Language-Model
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖 |
|
Experimental |
| 61 |
HYUNJS/STTM
[ICCV 2025] Multi-Granular Spatio-Temporal Token Merging for Training-Free... |
|
Experimental |
| 62 |
MYMY-young/DelimScaling
[ICLR 2026] Official implementation of "Enhancing Multi-Image Understanding... |
|
Experimental |
| 63 |
baldoarbol/BodyShapeGPT
Fine-tuned LLMs generate accurate 3D human avatars from textual descriptions... |
|
Experimental |
| 64 |
ParadoxZW/LLaVA-UHD-Better
A bug-free and improved implementation of LLaVA-UHD, based on the code from... |
|
Experimental |
| 65 |
Jacksonlark/open-mllms
open llm for multimodal |
|
Experimental |
| 66 |
mbzuai-oryx/Video-LLaVA
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models |
|
Experimental |
| 67 |
Victorwz/MLM_Filter
Official implementation of our paper "Finetuned Multimodal Language Models... |
|
Experimental |
| 68 |
2toinf/IVM
[NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking" |
|
Experimental |
| 69 |
zengqunzhao/Exp-CLIP
[WACV'25 Oral] Enhancing Zero-Shot Facial Expression Recognition by LLM... |
|
Experimental |
| 70 |
agentic-learning-ai-lab/lifelong-memory
Code for LifelongMemory: Leveraging LLMs for Answering Queries in Long-form... |
|
Experimental |
| 71 |
WisconsinAIVision/YoLLaVA
🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant (NeurIPS 2024) |
|
Experimental |
| 72 |
InternLM/Visual-ERM
Official Implementation of "Visual-ERM: Reward Modeling for Visual Equivalence" |
|
Experimental |
| 73 |
astra-vision/LatteCLIP
[WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts |
|
Experimental |
| 74 |
lizhaoliu-Lec/CG-VLM
This is the official repo for Contrastive Vision-Language Alignment Makes... |
|
Experimental |
| 75 |
SlytherinGe/RSTeller
Vision-Language Dataset for Remote Sensing |
|
Experimental |
| 76 |
UCSC-VLAA/Sight-Beyond-Text
[TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal... |
|
Experimental |
| 77 |
fatemehpesaran310/Text2Chart31
Official PyTorch implementation of "Text2Chart31: Instruction Tuning for... |
|
Experimental |
| 78 |
ProGamerGov/VLM-Captioning-Tools
Python scripts to use for captioning images with VLMs |
|
Experimental |
| 79 |
kyegomez/AudioFlamingo
Implementation of the model "AudioFlamingo" from the paper: "Audio Flamingo:... |
|
Experimental |
| 80 |
InternRobotics/Grounded_3D-LLM
Code&Data for Grounded 3D-LLM with Referent Tokens |
|
Experimental |
| 81 |
showlab/VisInContext
Official implementation of Leveraging Visual Tokens for Extended Text... |
|
Experimental |
| 82 |
Letian2003/MM_INF
An efficient multi-modal instruction-following data synthesis tool and the... |
|
Experimental |
| 83 |
ChenDelong1999/polite-flamingo
🦩 Official repository of paper "Visual Instruction Tuning with Polite... |
|
Experimental |
| 84 |
Traffic-Alpha/VLMLight
Official implementation of VLMLight |
|
Experimental |
| 85 |
bagh2178/GC-VLN
[CoRL 2025] GC-VLN: Instruction as Graph Constraints for Training-free... |
|
Experimental |
| 86 |
claws-lab/projection-in-MLLMs
Code and data for ACL 2024 paper on 'Cross-Modal Projection in Multimodal... |
|
Experimental |
| 87 |
ai4ce/LLM4VPR
Can multimodal LLM help visual place recognition? |
|
Experimental |
| 88 |
nkkbr/ViCA
This is the official implementation of ViCA2 (Visuospatial Cognitive... |
|
Experimental |
| 89 |
OpenM3D/M3DBench
[ECCV 2024] M3DBench introduces a comprehensive 3D instruction-following... |
|
Experimental |