Vision Language Instruction Tuning Transformer Models

Tools for training and fine-tuning multimodal models that combine vision and language through instruction-based learning. Includes efficient architectures, video understanding, and grounded vision-language models. Does NOT include general vision transformers, image captioning without instruction tuning, or non-multimodal LLM fine-tuning.

There are 34 vision language instruction tuning models tracked. 1 score above 50 (established tier). The highest-rated is TinyLLaVA/TinyLLaVA_Factory at 54/100 with 962 stars. 1 of the top 10 are actively maintained.

Get all 34 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=vision-language-instruction-tuning&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 TinyLLaVA/TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

54
Established
2 zjunlp/EasyInstruct

[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.

48
Emerging
3 haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V...

47
Emerging
4 DAMO-NLP-SG/Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language...

46
Emerging
5 Instruction-Tuning-with-GPT-4/GPT-4-LLM

Instruction Tuning with GPT-4

45
Emerging
6 rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

42
Emerging
7 NVlabs/Eagle

Eagle: Frontier Vision-Language Models with Data-Centric Strategies

40
Emerging
8 open-mmlab/Multimodal-GPT

Multimodal-GPT

38
Emerging
9 X-PLUG/mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

38
Emerging
10 AdrianBZG/llama-multimodal-vqa

Multimodal Instruction Tuning for Llama 3

34
Emerging
11 FoundationVision/Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual...

34
Emerging
12 ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the...

34
Emerging
13 shikiw/OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large...

34
Emerging
14 aihao2000/DPN-LLaVA

Arxiv 25: Dynamic Pyramid Network for Efficient Multimodal Large Language Model

34
Emerging
15 mlpc-ucsd/BLIVA

(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich...

34
Emerging
16 WisconsinAIVision/ViP-LLaVA

[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary...

31
Emerging
17 madibabaiasl/MobileRobotGPT4LLaMA2024

Deployment of Large Language Models to Control Mobile Robots at the Edge

31
Emerging
18 Yxxxb/VoCo-LLaMA

[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of...

31
Emerging
19 mu-cai/matryoshka-mm

Matryoshka Multimodal Models

30
Emerging
20 BUAADreamer/Chinese-LLaVA-Med

中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine

29
Experimental
21 InternRobotics/PointLLM

[ECCV 2024 Best Paper Candidate & TPAMI 2025] PointLLM: Empowering Large...

28
Experimental
22 OpenBMB/VisCPM

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat...

28
Experimental
23 kyegomez/Qwen-VL

My personal implementation of the model from "Qwen-VL: A Frontier Large...

25
Experimental
24 xiaoachen98/Open-LLaVA-NeXT

An open-source implementation for training LLaVA-NeXT.

24
Experimental
25 shikiw/Modality-Integration-Rate

[ICCV 2025] The official code of the paper "Deciphering Cross-Modal...

24
Experimental
26 visresearch/LLaVA-STF

The official implementation of "Learning Compact Vision Tokens for Efficient...

24
Experimental
27 Honee-W/U-SAM

Official repository for U-SAM (Interspeech 2025)

23
Experimental
28 nobel-postech/mirror

Code and data for "MIRROR: Multimodal Cognitive Reframing Therapy for...

23
Experimental
29 Gary3410/TaPA

[arXiv 2023] Embodied Task Planning with Large Language Models

22
Experimental
30 AdrienneDeganutti/DANTE-AD

"DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description"...

21
Experimental
31 rese1f/STEVE

[ECCV 2024] STEVE in Minecraft is for See and Think: Embodied Agent in...

16
Experimental
32 0606zt/PanoLlama

[ICCV 2025 Highlight] Panorama Generation as a Next-Token Prediction Task.

15
Experimental
33 ashleykleynhans/llava-docker

Docker image for LLaVA: Large Language and Vision Assistant

14
Experimental
34 tvtung2902/poem_generator

AI-generated Vietnamese Luc Bat poetry using fine-tuned VinaLLaMA model.

11
Experimental