Multimodal Vision Language LLM Tools

LLMs designed for understanding and generating content across vision, audio, video, and temporal modalities. Includes models that process images, videos, 3D shapes, and audio alongside text. Does NOT include single-modality tools, general text-only LLMs, or tools that only caption/describe without deeper reasoning.

There are 92 multimodal vision language tools tracked. 2 score above 50 (established tier). The highest-rated is jingyaogong/minimind-v at 63/100 with 6,712 stars. 2 of the top 10 are actively maintained.

Get all 92 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	jingyaogong/minimind-v 🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM！🌏 Train a 26M-parameter VLM from scratch in...	63	Established	6,712	Python
2	SkyworkAI/Skywork-R1V Skywork-R1V is an advanced multimodal AI model series developed by Skywork...	51	Established	3,158	Python
3	NExT-GPT/NExT-GPT Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large...	48	Emerging	3,618	Python
4	roboflow/vision-ai-checkup Take your LLM to the optometrist.	48	Emerging	46	Python
5	OpenGVLab/InternVL [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to...	47	Emerging	9,879	Python
6	InternLM/InternLM-XComposer InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for...	46	Emerging	2,922	Python
7	OpenGVLab/Ask-Anything [CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And...	45	Emerging	3,335	Python
8	zai-org/GLM-TTS GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward...	45	Emerging	949	Python
9	JIA-Lab-research/MGM Official repo for "Mini-Gemini: Mining the Potential of Multi-modality...	45	Emerging	3,334	Python
10	EvolvingLMMs-Lab/NEO NEO Series: Native Vision-Language Models from First Principles	44	Emerging	675	Python
11	EvolvingLMMs-Lab/Otter 🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of...	44	Emerging	3,344	Python
12	EvolvingLMMs-Lab/LLaVA-OneVision-1.5 Fully Open Framework for Democratized Multimodal Training	41	Emerging	762	Python
13	connorkapoor/Palmetto A simple web-based CAD workbench for discovering and creating DFM (Design...	40	Emerging	19	C++
14	huangwl18/VoxPoser VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models	40	Emerging	786	Python
15	ihp-lab/Face-LLaVA [WACV 2026] Face-LLaVA: Facial Expression and Attribute Understanding...	39	Emerging	11	Python
16	OceanGPT/OceanGPT [沧渊] [ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks	36	Emerging	101	Python
17	bagh2178/SG-Nav [NeurIPS 2024] SG-Nav: Online 3D Scene Graph Prompting for LLM-based...	35	Emerging	323	Jupyter Notebook
18	LLaVA-VL/LLaVA-Plus-Codebase LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills	35	Emerging	763	Python
19	thuml/iVideoGPT Official repository for "iVideoGPT: Interactive VideoGPTs are Scalable World...	35	Emerging	172	Python
20	umuttt5738/neurosymbolic-vqa-program-generator 🧠 Generate executable programs from natural language questions using a...	35	Emerging	1	Python
21	YvanYin/DrivingWorld Code for "DrivingWorld: Constructing World Model for Autonomous Driving via...	34	Emerging	238	Python
22	JIA-Lab-research/LLMGA This project is the official implementation of 'LLMGA: Multimodal Large...	34	Emerging	398	Python
23	yuanze-lin/Olympus [CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router...	34	Emerging	427	Python
24	tincans-ai/gazelle Joint speech-language model - respond directly to audio!	34	Emerging	373	Python
25	PKU-YuanGroup/Chat-UniVi [CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers...	34	Emerging	946	Python
26	FusionBrainLab/OmniFusion OmniFusion — a multimodal model to communicate using text and images	34	Emerging	235	Python
27	dimitrismallis/CAD-Assistant Code for our ICCV 2025 paper "CAD-Assistant: Tool-Augmented VLLMs as Generic...	33	Emerging	47	Python
28	SALT-NLP/Sketch2Code Code for the paper: Sketch2Code: Evaluating Vision-Language Models for...	33	Emerging	37	Python
29	isaaccorley/goldeneye GoldenEye is a library of geospatial vision-language models -- run any...	32	Emerging	8	Python
30	Pointcept/GPT4Point [CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language...	32	Emerging	441	Python
31	wgcyeo/WorldMM [CVPR 2026] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning	32	Emerging	61	Python
32	MooreThreads/MooER MooER: Moore-threads Open Omni model for speech-to-speech intERaction....	32	Emerging	218	Python
33	isjinghao/OralGPT [NeurIPS'25 \| CVPR'26] The official repo of OralGPT & MMOral Bench.	31	Emerging	75	Python
34	H-Freax/ThinkGrasp [CoRL2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping...	31	Emerging	113	Python
35	greenland-dream/video-understanding This repository provides core code for managing large volumes of video...	31	Emerging	20	Python
36	worldbench/VideoLucy [NeurIPS 2025] Deep Memory Backtracking for Long Video Understanding	30	Emerging	64	Python
37	mbzuai-oryx/LLaVA-pp 🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)	29	Experimental	848	Python
38	Open3DA/LL3DA [CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D...	29	Experimental	311	Python
39	nuldertien/PathBLIP-2 This repository contains all code to support the paper: "On the Importance...	28	Experimental	2	Jupyter Notebook
40	om-ai-lab/ZoomEye [EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming...	28	Experimental	77	Python
41	haesleinhuepf/vlm-pictionary Play pictionary with Vision Language Models!	26	Experimental	6	Jupyter Notebook
42	Hiram31/CADialogue Official implementation of "CADialogue: A Multimodal LLM-Powered...	26	Experimental	12	Python
43	FuxiaoLiu/MMC [NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM...	26	Experimental	95	Python
44	bigai-nlco/VideoTGB [EMNLP 2024] A Video Chat Agent with Temporal Prior	25	Experimental	32	Python
45	WisconsinAIVision/YoChameleon 🦎 Yo'Chameleon: Your Personalized Chameleon (CVPR 2025)	25	Experimental	151	Python
46	Toommo2/Text2CAD 🚀 Convert natural language to real CAD artifacts with Text2CAD, an...	25	Experimental	—	Python
47	luxus180/LLaVA-OneVision-1.5 🛠️ Build and train multimodal models easily with LLaVA-OneVision 1.5, an...	25	Experimental	3	Python
48	DonaldTrump-coder/Informative-Scene-Reconstruction-App A local software and cloud service system that integrates 3D functionalities...	25	Experimental	3	Python
49	Piero24/VLM-Object-Detection A pipeline for object detection and segmentation using a Vision-Language...	25	Experimental	1	Jupyter Notebook
50	Blinorot/ALARM Official Implementation of "ALARM: Audio–Language Alignment for Reasoning Models"	25	Experimental	3	Python
51	smsnobin77/Awesome-Multimodal-Unlearning This repo presents a survey of multimodal unlearning across vision,...	25	Experimental	3	Jupyter Notebook
52	ShareGPT4Omni/ShareGPT4Video [NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving...	24	Experimental	1,088	Python
53	yifanlu0227/ChatSim [CVPR2024 Highlight] Editable Scene Simulation for Autonomous Driving via...	24	Experimental	419	Python
54	ZPider0/Multimodal 🎤 Transform speech and text with this lightweight Python toolkit for...	24	Experimental	2	Jupyter Notebook
55	showlab/VLog [CVPR 2025] Video Narration as Vocabulary & Video as Long Document	24	Experimental	590	Python
56	tenghuilee/ScalingCapFusedVisionLM number of tokens <=> performance to a vision language model	24	Experimental	2	Python
57	XduSyL/EventGPT 🔥[CVPR2025] EventGPT: Event Stream Understanding with Multimodal Large...	24	Experimental	104	Python
58	timmylucy/GLM-ASR 🔊 Enhance speech recognition with GLM-ASR-Nano-2512, a high-performance...	23	Experimental	1	Python
59	Hyeongkeun/LAVCap Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual...	23	Experimental	10	Python
60	fz-zsl/QuatRoPE The official implementation for CVPR 2026 paper Scalable Object Relation...	22	Experimental	3	Python
61	OmniMMI/OmniMMI [CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in...	21	Experimental	21	Python
62	SiyuWang0906/CAD-GPT [AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial...	20	Experimental	48	—
63	ShareGPT4Omni/ShareGPT4V [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions	20	Experimental	251	Python
64	anymodality/anymodality AnyModality is an open-source library to simplify MultiModal LLM inference...	20	Experimental	2	Python
65	whwu95/FreeVA FreeVA: Offline MLLM as Training-Free Video Assistant	20	Experimental	69	Python
66	yophis/partial-yarn Partial YaRN and VLAT: techniques for efficiently extending audio context of...	19	Experimental	—	Python
67	hpfield/Text2Touch CoRL 2025 - Tactile In-Hand Manipulation with LLM-Designed Reward Functions	19	Experimental	7	Jupyter Notebook
68	hamedR96/User-VLM Personalized Vision Language Models for Social Human-Robot Interactions	19	Experimental	5	JavaScript
69	termehtaheri/SAR-LM Official implementation of “SAR-LM: Symbolic Audio Reasoning with Large...	18	Experimental	4	Python
70	MariyamSiddiqui/Zero-shot-image-to-text-generation-with-BLIP-2 Zero-shot image-to-text generation using Salesforce’s BLIP-2 model —...	17	Experimental	2	Jupyter Notebook
71	alexander-moore/vlm Composition of Multimodal Language Models From Scratch	17	Experimental	15	Jupyter Notebook
72	InternRobotics/VLM-Grounder [CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	17	Experimental	129	Python
73	BaohaoLiao/road [NeurIPS 2024] 3-in-1: 2D Rotary Adaptation for Efficient Finetuning,...	17	Experimental	2	—
74	paxnea/LLM-multimodal-nudging Zero-Shot Learning for Multimodal Nudging	16	Experimental	3	Jupyter Notebook
75	Pittawat2542/driving-assessment-distillation This repository contains the code and data for the paper "Speed Up!...	16	Experimental	3	Jupyter Notebook
76	Atomic-man007/blip-vision-language BLIP is a novel Vision-Language Pre-training (VLP) framework designed to...	15	Experimental	2	Jupyter Notebook
77	mariyahendriksen/ecir2022_category_to_image_retrieval This repository contains the code for the paper "Extending CLIP for...	15	Experimental	6	Jupyter Notebook
78	ais-lab/FaceAIS_REACT24 [FG 2024] Finite Scalar Quantization as Facial Tokenizer for Dyadic Reaction...	15	Experimental	6	Python
79	yueying-teng/generate-language-image-instruction-following-data Mistral assisted visual instruction data generation by following LLaVA	14	Experimental	1	Python
80	sonkd/Visual-Question-Answering-on-VizWiz Visual Question Answering on VizWiz, A Generative CLIP + LSTM Approach with...	14	Experimental	—	Jupyter Notebook
81	engindeniz/vitis [ICCV 2023 CLVL Workshop] Zero-Shot and Few-Shot Video Question Answering...	14	Experimental	14	Python
82	OpenShapeLab/ShapeGPT ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model, a...	14	Experimental	99	—
83	ikun-llm/ikun-V 多模态视觉语言模型 \| Vision-Language Model 👁️	14	Experimental	—	—
84	zhudotexe/kani-vision Kani extension for supporting vision-language models (VLMs). Comes with...	13	Experimental	7	Python
85	Jeremyyny/Value-Spectrum Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value...	13	Experimental	2	—
86	scb-10x/partial-yarn Partial YaRN and VLAT: techniques for efficiently extending audio context of...	12	Experimental	3	Python
87	Flagro/OmniModKit Multimodal LLM toolkit	12	Experimental	1	Python
88	Jshulgach/Grounded-SAM-2-Stream Track anything in streaming with Grounding DINO, SAM 2, and LLM	12	Experimental	4	Python
89	PrateekJannu/Vision-GPT Coding a Multi-Modal vision model like GPT-4o from scratch, inspired by...	10	Experimental	1	Python
90	KDEGroup/MMICT Source code for TOMM'24 paper "MMICT: Boosting Multi-Modal Fine-Tuning with...	10	Experimental	1	Python
91	mahshid1378/VALL-E PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo...	10	Experimental	1	Python
92	oncescuandreea/audio_egovlp This is the official codebase used for obtaining the results in the ICASSP...	10	Experimental	3	Python