Vision Language Instruction Tuning Transformer Models

Tools for training and fine-tuning multimodal models that combine vision and language through instruction-based learning. Includes efficient architectures, video understanding, and grounded vision-language models. Does NOT include general vision transformers, image captioning without instruction tuning, or non-multimodal LLM fine-tuning.

There are 34 vision language instruction tuning models tracked. 1 score above 50 (established tier). The highest-rated is TinyLLaVA/TinyLLaVA_Factory at 54/100 with 962 stars. 1 of the top 10 are actively maintained.

Get all 34 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=vision-language-instruction-tuning&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	TinyLLaVA/TinyLLaVA_Factory A Framework of Small-scale Large Multimodal Models	54	Established	962	Python
2	zjunlp/EasyInstruct [ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.	48	Emerging	409	Python
3	haotian-liu/LLaVA [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V...	47	Emerging	24,554	Python
4	DAMO-NLP-SG/Video-LLaMA [EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language...	46	Emerging	3,134	Python
5	Instruction-Tuning-with-GPT-4/GPT-4-LLM Instruction Tuning with GPT-4	45	Emerging	4,339	HTML
6	rese1f/MovieChat [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	42	Emerging	688	Python
7	NVlabs/Eagle Eagle: Frontier Vision-Language Models with Data-Centric Strategies	40	Emerging	931	Python
8	open-mmlab/Multimodal-GPT Multimodal-GPT	38	Emerging	1,517	Python
9	X-PLUG/mPLUG-Owl mPLUG-Owl: The Powerful Multi-modal Large Language Model Family	38	Emerging	2,540	Python
10	AdrianBZG/llama-multimodal-vqa Multimodal Instruction Tuning for Llama 3	34	Emerging	51	Python
11	FoundationVision/Groma [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual...	34	Emerging	584	Python
12	ictnlp/LLaVA-Mini LLaVA-Mini is a unified large multimodal model (LMM) that can support the...	34	Emerging	562	Python
13	shikiw/OPERA [CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large...	34	Emerging	399	Python
14	aihao2000/DPN-LLaVA Arxiv 25: Dynamic Pyramid Network for Efficient Multimodal Large Language Model	34	Emerging	44	Python
15	mlpc-ucsd/BLIVA (AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich...	34	Emerging	260	Python
16	WisconsinAIVision/ViP-LLaVA [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary...	31	Emerging	336	Python
17	madibabaiasl/MobileRobotGPT4LLaMA2024 Deployment of Large Language Models to Control Mobile Robots at the Edge	31	Emerging	11	Python
18	Yxxxb/VoCo-LLaMA [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of...	31	Emerging	203	Python
19	mu-cai/matryoshka-mm Matryoshka Multimodal Models	30	Emerging	122	Python
20	BUAADreamer/Chinese-LLaVA-Med 中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine	29	Experimental	103	Python
21	InternRobotics/PointLLM [ECCV 2024 Best Paper Candidate & TPAMI 2025] PointLLM: Empowering Large...	28	Experimental	983	Python
22	OpenBMB/VisCPM [ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat...	28	Experimental	1,070	Python
23	kyegomez/Qwen-VL My personal implementation of the model from "Qwen-VL: A Frontier Large...	25	Experimental	12	Python
24	xiaoachen98/Open-LLaVA-NeXT An open-source implementation for training LLaVA-NeXT.	24	Experimental	436	Python
25	shikiw/Modality-Integration-Rate [ICCV 2025] The official code of the paper "Deciphering Cross-Modal...	24	Experimental	111	Python
26	visresearch/LLaVA-STF The official implementation of "Learning Compact Vision Tokens for Efficient...	24	Experimental	29	Python
27	Honee-W/U-SAM Official repository for U-SAM (Interspeech 2025)	23	Experimental	26	Python
28	nobel-postech/mirror Code and data for "MIRROR: Multimodal Cognitive Reframing Therapy for...	23	Experimental	5	Jupyter Notebook
29	Gary3410/TaPA [arXiv 2023] Embodied Task Planning with Large Language Models	22	Experimental	193	Python
30	AdrienneDeganutti/DANTE-AD "DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description"...	21	Experimental	2	Python
31	rese1f/STEVE [ECCV 2024] STEVE in Minecraft is for See and Think: Embodied Agent in...	16	Experimental	41	—
32	0606zt/PanoLlama [ICCV 2025 Highlight] Panorama Generation as a Next-Token Prediction Task.	15	Experimental	48	Python
33	ashleykleynhans/llava-docker Docker image for LLaVA: Large Language and Vision Assistant	14	Experimental	4	Shell
34	tvtung2902/poem_generator AI-generated Vietnamese Luc Bat poetry using fine-tuned VinaLLaMA model.	11	Experimental	—	Python

Comparisons in this category

LLaVA and Video-LLaMA (47 vs 46) LLaVA and LLaVA-Mini (47 vs 34) LLaVA and ViP-LLaVA (47 vs 31) LLaVA and llama-multimodal-vqa (47 vs 34) TinyLLaVA_Factory and LLaVA-Mini (54 vs 34)