Multimodal Vision Language LLM Tools
LLMs designed for understanding and generating content across vision, audio, video, and temporal modalities. Includes models that process images, videos, 3D shapes, and audio alongside text. Does NOT include single-modality tools, general text-only LLMs, or tools that only caption/describe without deeper reasoning.
There are 92 multimodal vision language tools tracked. 2 score above 50 (established tier). The highest-rated is jingyaogong/minimind-v at 63/100 with 6,712 stars. 2 of the top 10 are actively maintained.
Get all 92 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=multimodal-vision-language&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
jingyaogong/minimind-v
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in... |
|
Established |
| 2 |
SkyworkAI/Skywork-R1V
Skywork-R1V is an advanced multimodal AI model series developed by Skywork... |
|
Established |
| 3 |
NExT-GPT/NExT-GPT
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large... |
|
Emerging |
| 4 |
roboflow/vision-ai-checkup
Take your LLM to the optometrist. |
|
Emerging |
| 5 |
OpenGVLab/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to... |
|
Emerging |
| 6 |
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for... |
|
Emerging |
| 7 |
OpenGVLab/Ask-Anything
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And... |
|
Emerging |
| 8 |
zai-org/GLM-TTS
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward... |
|
Emerging |
| 9 |
JIA-Lab-research/MGM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality... |
|
Emerging |
| 10 |
EvolvingLMMs-Lab/NEO
NEO Series: Native Vision-Language Models from First Principles |
|
Emerging |
| 11 |
EvolvingLMMs-Lab/Otter
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of... |
|
Emerging |
| 12 |
EvolvingLMMs-Lab/LLaVA-OneVision-1.5
Fully Open Framework for Democratized Multimodal Training |
|
Emerging |
| 13 |
connorkapoor/Palmetto
A simple web-based CAD workbench for discovering and creating DFM (Design... |
|
Emerging |
| 14 |
huangwl18/VoxPoser
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models |
|
Emerging |
| 15 |
ihp-lab/Face-LLaVA
[WACV 2026] Face-LLaVA: Facial Expression and Attribute Understanding... |
|
Emerging |
| 16 |
OceanGPT/OceanGPT
[沧渊] [ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks |
|
Emerging |
| 17 |
bagh2178/SG-Nav
[NeurIPS 2024] SG-Nav: Online 3D Scene Graph Prompting for LLM-based... |
|
Emerging |
| 18 |
LLaVA-VL/LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills |
|
Emerging |
| 19 |
thuml/iVideoGPT
Official repository for "iVideoGPT: Interactive VideoGPTs are Scalable World... |
|
Emerging |
| 20 |
umuttt5738/neurosymbolic-vqa-program-generator
🧠 Generate executable programs from natural language questions using a... |
|
Emerging |
| 21 |
YvanYin/DrivingWorld
Code for "DrivingWorld: Constructing World Model for Autonomous Driving via... |
|
Emerging |
| 22 |
JIA-Lab-research/LLMGA
This project is the official implementation of 'LLMGA: Multimodal Large... |
|
Emerging |
| 23 |
yuanze-lin/Olympus
[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router... |
|
Emerging |
| 24 |
tincans-ai/gazelle
Joint speech-language model - respond directly to audio! |
|
Emerging |
| 25 |
PKU-YuanGroup/Chat-UniVi
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers... |
|
Emerging |
| 26 |
FusionBrainLab/OmniFusion
OmniFusion — a multimodal model to communicate using text and images |
|
Emerging |
| 27 |
dimitrismallis/CAD-Assistant
Code for our ICCV 2025 paper "CAD-Assistant: Tool-Augmented VLLMs as Generic... |
|
Emerging |
| 28 |
SALT-NLP/Sketch2Code
Code for the paper: Sketch2Code: Evaluating Vision-Language Models for... |
|
Emerging |
| 29 |
isaaccorley/goldeneye
GoldenEye is a library of geospatial vision-language models -- run any... |
|
Emerging |
| 30 |
Pointcept/GPT4Point
[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language... |
|
Emerging |
| 31 |
wgcyeo/WorldMM
[CVPR 2026] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning |
|
Emerging |
| 32 |
MooreThreads/MooER
MooER: Moore-threads Open Omni model for speech-to-speech intERaction.... |
|
Emerging |
| 33 |
isjinghao/OralGPT
[NeurIPS'25 | CVPR'26] The official repo of OralGPT & MMOral Bench. |
|
Emerging |
| 34 |
H-Freax/ThinkGrasp
[CoRL2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping... |
|
Emerging |
| 35 |
greenland-dream/video-understanding
This repository provides core code for managing large volumes of video... |
|
Emerging |
| 36 |
worldbench/VideoLucy
[NeurIPS 2025] Deep Memory Backtracking for Long Video Understanding |
|
Emerging |
| 37 |
mbzuai-oryx/LLaVA-pp
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3) |
|
Experimental |
| 38 |
Open3DA/LL3DA
[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D... |
|
Experimental |
| 39 |
nuldertien/PathBLIP-2
This repository contains all code to support the paper: "On the Importance... |
|
Experimental |
| 40 |
om-ai-lab/ZoomEye
[EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming... |
|
Experimental |
| 41 |
haesleinhuepf/vlm-pictionary
Play pictionary with Vision Language Models! |
|
Experimental |
| 42 |
Hiram31/CADialogue
Official implementation of "CADialogue: A Multimodal LLM-Powered... |
|
Experimental |
| 43 |
FuxiaoLiu/MMC
[NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM... |
|
Experimental |
| 44 |
bigai-nlco/VideoTGB
[EMNLP 2024] A Video Chat Agent with Temporal Prior |
|
Experimental |
| 45 |
WisconsinAIVision/YoChameleon
🦎 Yo'Chameleon: Your Personalized Chameleon (CVPR 2025) |
|
Experimental |
| 46 |
Toommo2/Text2CAD
🚀 Convert natural language to real CAD artifacts with Text2CAD, an... |
|
Experimental |
| 47 |
luxus180/LLaVA-OneVision-1.5
🛠️ Build and train multimodal models easily with LLaVA-OneVision 1.5, an... |
|
Experimental |
| 48 |
DonaldTrump-coder/Informative-Scene-Reconstruction-App
A local software and cloud service system that integrates 3D functionalities... |
|
Experimental |
| 49 |
Piero24/VLM-Object-Detection
A pipeline for object detection and segmentation using a Vision-Language... |
|
Experimental |
| 50 |
Blinorot/ALARM
Official Implementation of "ALARM: Audio–Language Alignment for Reasoning Models" |
|
Experimental |
| 51 |
smsnobin77/Awesome-Multimodal-Unlearning
This repo presents a survey of multimodal unlearning across vision,... |
|
Experimental |
| 52 |
ShareGPT4Omni/ShareGPT4Video
[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving... |
|
Experimental |
| 53 |
yifanlu0227/ChatSim
[CVPR2024 Highlight] Editable Scene Simulation for Autonomous Driving via... |
|
Experimental |
| 54 |
ZPider0/Multimodal
🎤 Transform speech and text with this lightweight Python toolkit for... |
|
Experimental |
| 55 |
showlab/VLog
[CVPR 2025] Video Narration as Vocabulary & Video as Long Document |
|
Experimental |
| 56 |
tenghuilee/ScalingCapFusedVisionLM
number of tokens <=> performance to a vision language model |
|
Experimental |
| 57 |
XduSyL/EventGPT
🔥[CVPR2025] EventGPT: Event Stream Understanding with Multimodal Large... |
|
Experimental |
| 58 |
timmylucy/GLM-ASR
🔊 Enhance speech recognition with GLM-ASR-Nano-2512, a high-performance... |
|
Experimental |
| 59 |
Hyeongkeun/LAVCap
Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual... |
|
Experimental |
| 60 |
fz-zsl/QuatRoPE
The official implementation for CVPR 2026 paper Scalable Object Relation... |
|
Experimental |
| 61 |
OmniMMI/OmniMMI
[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in... |
|
Experimental |
| 62 |
SiyuWang0906/CAD-GPT
[AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial... |
|
Experimental |
| 63 |
ShareGPT4Omni/ShareGPT4V
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions |
|
Experimental |
| 64 |
anymodality/anymodality
AnyModality is an open-source library to simplify MultiModal LLM inference... |
|
Experimental |
| 65 |
whwu95/FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant |
|
Experimental |
| 66 |
yophis/partial-yarn
Partial YaRN and VLAT: techniques for efficiently extending audio context of... |
|
Experimental |
| 67 |
hpfield/Text2Touch
CoRL 2025 - Tactile In-Hand Manipulation with LLM-Designed Reward Functions |
|
Experimental |
| 68 |
hamedR96/User-VLM
Personalized Vision Language Models for Social Human-Robot Interactions |
|
Experimental |
| 69 |
termehtaheri/SAR-LM
Official implementation of “SAR-LM: Symbolic Audio Reasoning with Large... |
|
Experimental |
| 70 |
MariyamSiddiqui/Zero-shot-image-to-text-generation-with-BLIP-2
Zero-shot image-to-text generation using Salesforce’s BLIP-2 model —... |
|
Experimental |
| 71 |
alexander-moore/vlm
Composition of Multimodal Language Models From Scratch |
|
Experimental |
| 72 |
InternRobotics/VLM-Grounder
[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding |
|
Experimental |
| 73 |
BaohaoLiao/road
[NeurIPS 2024] 3-in-1: 2D Rotary Adaptation for Efficient Finetuning,... |
|
Experimental |
| 74 |
paxnea/LLM-multimodal-nudging
Zero-Shot Learning for Multimodal Nudging |
|
Experimental |
| 75 |
Pittawat2542/driving-assessment-distillation
This repository contains the code and data for the paper "Speed Up!... |
|
Experimental |
| 76 |
Atomic-man007/blip-vision-language
BLIP is a novel Vision-Language Pre-training (VLP) framework designed to... |
|
Experimental |
| 77 |
mariyahendriksen/ecir2022_category_to_image_retrieval
This repository contains the code for the paper "Extending CLIP for... |
|
Experimental |
| 78 |
ais-lab/FaceAIS_REACT24
[FG 2024] Finite Scalar Quantization as Facial Tokenizer for Dyadic Reaction... |
|
Experimental |
| 79 |
yueying-teng/generate-language-image-instruction-following-data
Mistral assisted visual instruction data generation by following LLaVA |
|
Experimental |
| 80 |
sonkd/Visual-Question-Answering-on-VizWiz
Visual Question Answering on VizWiz, A Generative CLIP + LSTM Approach with... |
|
Experimental |
| 81 |
engindeniz/vitis
[ICCV 2023 CLVL Workshop] Zero-Shot and Few-Shot Video Question Answering... |
|
Experimental |
| 82 |
OpenShapeLab/ShapeGPT
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model, a... |
|
Experimental |
| 83 |
ikun-llm/ikun-V
多模态视觉语言模型 | Vision-Language Model 👁️ |
|
Experimental |
| 84 |
zhudotexe/kani-vision
Kani extension for supporting vision-language models (VLMs). Comes with... |
|
Experimental |
| 85 |
Jeremyyny/Value-Spectrum
Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value... |
|
Experimental |
| 86 |
scb-10x/partial-yarn
Partial YaRN and VLAT: techniques for efficiently extending audio context of... |
|
Experimental |
| 87 |
Flagro/OmniModKit
Multimodal LLM toolkit |
|
Experimental |
| 88 |
Jshulgach/Grounded-SAM-2-Stream
Track anything in streaming with Grounding DINO, SAM 2, and LLM |
|
Experimental |
| 89 |
PrateekJannu/Vision-GPT
Coding a Multi-Modal vision model like GPT-4o from scratch, inspired by... |
|
Experimental |
| 90 |
KDEGroup/MMICT
Source code for TOMM'24 paper "MMICT: Boosting Multi-Modal Fine-Tuning with... |
|
Experimental |
| 91 |
mahshid1378/VALL-E
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo... |
|
Experimental |
| 92 |
oncescuandreea/audio_egovlp
This is the official codebase used for obtaining the results in the ICASSP... |
|
Experimental |