Vision Language Instruction Tuning Transformer Models
Tools for training and fine-tuning multimodal models that combine vision and language through instruction-based learning. Includes efficient architectures, video understanding, and grounded vision-language models. Does NOT include general vision transformers, image captioning without instruction tuning, or non-multimodal LLM fine-tuning.
There are 34 vision language instruction tuning models tracked. 1 score above 50 (established tier). The highest-rated is TinyLLaVA/TinyLLaVA_Factory at 54/100 with 962 stars. 1 of the top 10 are actively maintained.
Get all 34 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=vision-language-instruction-tuning&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
TinyLLaVA/TinyLLaVA_Factory
A Framework of Small-scale Large Multimodal Models |
|
Established |
| 2 |
zjunlp/EasyInstruct
[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs. |
|
Emerging |
| 3 |
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V... |
|
Emerging |
| 4 |
DAMO-NLP-SG/Video-LLaMA
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language... |
|
Emerging |
| 5 |
Instruction-Tuning-with-GPT-4/GPT-4-LLM
Instruction Tuning with GPT-4 |
|
Emerging |
| 6 |
rese1f/MovieChat
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding |
|
Emerging |
| 7 |
NVlabs/Eagle
Eagle: Frontier Vision-Language Models with Data-Centric Strategies |
|
Emerging |
| 8 |
open-mmlab/Multimodal-GPT
Multimodal-GPT |
|
Emerging |
| 9 |
X-PLUG/mPLUG-Owl
mPLUG-Owl: The Powerful Multi-modal Large Language Model Family |
|
Emerging |
| 10 |
AdrianBZG/llama-multimodal-vqa
Multimodal Instruction Tuning for Llama 3 |
|
Emerging |
| 11 |
FoundationVision/Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual... |
|
Emerging |
| 12 |
ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the... |
|
Emerging |
| 13 |
shikiw/OPERA
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large... |
|
Emerging |
| 14 |
aihao2000/DPN-LLaVA
Arxiv 25: Dynamic Pyramid Network for Efficient Multimodal Large Language Model |
|
Emerging |
| 15 |
mlpc-ucsd/BLIVA
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich... |
|
Emerging |
| 16 |
WisconsinAIVision/ViP-LLaVA
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary... |
|
Emerging |
| 17 |
madibabaiasl/MobileRobotGPT4LLaMA2024
Deployment of Large Language Models to Control Mobile Robots at the Edge |
|
Emerging |
| 18 |
Yxxxb/VoCo-LLaMA
[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of... |
|
Emerging |
| 19 |
mu-cai/matryoshka-mm
Matryoshka Multimodal Models |
|
Emerging |
| 20 |
BUAADreamer/Chinese-LLaVA-Med
中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine |
|
Experimental |
| 21 |
InternRobotics/PointLLM
[ECCV 2024 Best Paper Candidate & TPAMI 2025] PointLLM: Empowering Large... |
|
Experimental |
| 22 |
OpenBMB/VisCPM
[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat... |
|
Experimental |
| 23 |
kyegomez/Qwen-VL
My personal implementation of the model from "Qwen-VL: A Frontier Large... |
|
Experimental |
| 24 |
xiaoachen98/Open-LLaVA-NeXT
An open-source implementation for training LLaVA-NeXT. |
|
Experimental |
| 25 |
shikiw/Modality-Integration-Rate
[ICCV 2025] The official code of the paper "Deciphering Cross-Modal... |
|
Experimental |
| 26 |
visresearch/LLaVA-STF
The official implementation of "Learning Compact Vision Tokens for Efficient... |
|
Experimental |
| 27 |
Honee-W/U-SAM
Official repository for U-SAM (Interspeech 2025) |
|
Experimental |
| 28 |
nobel-postech/mirror
Code and data for "MIRROR: Multimodal Cognitive Reframing Therapy for... |
|
Experimental |
| 29 |
Gary3410/TaPA
[arXiv 2023] Embodied Task Planning with Large Language Models |
|
Experimental |
| 30 |
AdrienneDeganutti/DANTE-AD
"DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description"... |
|
Experimental |
| 31 |
rese1f/STEVE
[ECCV 2024] STEVE in Minecraft is for See and Think: Embodied Agent in... |
|
Experimental |
| 32 |
0606zt/PanoLlama
[ICCV 2025 Highlight] Panorama Generation as a Next-Token Prediction Task. |
|
Experimental |
| 33 |
ashleykleynhans/llava-docker
Docker image for LLaVA: Large Language and Vision Assistant |
|
Experimental |
| 34 |
tvtung2902/poem_generator
AI-generated Vietnamese Luc Bat poetry using fine-tuned VinaLLaMA model. |
|
Experimental |