Multimodal Vision Language LLM Tools

LLMs designed for understanding and generating content across vision, audio, video, and temporal modalities. Includes models that process images, videos, 3D shapes, and audio alongside text. Does NOT include single-modality tools, general text-only LLMs, or tools that only caption/describe without deeper reasoning.

There are 92 multimodal vision language tools tracked. 2 score above 50 (established tier). The highest-rated is jingyaogong/minimind-v at 63/100 with 6,712 stars. 2 of the top 10 are actively maintained.

Get all 92 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 jingyaogong/minimind-v

🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in...

63
Established
2 SkyworkAI/Skywork-R1V

Skywork-R1V is an advanced multimodal AI model series developed by Skywork...

51
Established
3 NExT-GPT/NExT-GPT

Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large...

48
Emerging
4 roboflow/vision-ai-checkup

Take your LLM to the optometrist.

48
Emerging
5 OpenGVLab/InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to...

47
Emerging
6 InternLM/InternLM-XComposer

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for...

46
Emerging
7 OpenGVLab/Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And...

45
Emerging
8 zai-org/GLM-TTS

GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward...

45
Emerging
9 JIA-Lab-research/MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality...

45
Emerging
10 EvolvingLMMs-Lab/NEO

NEO Series: Native Vision-Language Models from First Principles

44
Emerging
11 EvolvingLMMs-Lab/Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of...

44
Emerging
12 EvolvingLMMs-Lab/LLaVA-OneVision-1.5

Fully Open Framework for Democratized Multimodal Training

41
Emerging
13 connorkapoor/Palmetto

A simple web-based CAD workbench for discovering and creating DFM (Design...

40
Emerging
14 huangwl18/VoxPoser

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

40
Emerging
15 ihp-lab/Face-LLaVA

[WACV 2026] Face-LLaVA: Facial Expression and Attribute Understanding...

39
Emerging
16 OceanGPT/OceanGPT

[沧渊] [ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks

36
Emerging
17 bagh2178/SG-Nav

[NeurIPS 2024] SG-Nav: Online 3D Scene Graph Prompting for LLM-based...

35
Emerging
18 LLaVA-VL/LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

35
Emerging
19 thuml/iVideoGPT

Official repository for "iVideoGPT: Interactive VideoGPTs are Scalable World...

35
Emerging
20 umuttt5738/neurosymbolic-vqa-program-generator

🧠 Generate executable programs from natural language questions using a...

35
Emerging
21 YvanYin/DrivingWorld

Code for "DrivingWorld: Constructing World Model for Autonomous Driving via...

34
Emerging
22 JIA-Lab-research/LLMGA

This project is the official implementation of 'LLMGA: Multimodal Large...

34
Emerging
23 yuanze-lin/Olympus

[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router...

34
Emerging
24 tincans-ai/gazelle

Joint speech-language model - respond directly to audio!

34
Emerging
25 PKU-YuanGroup/Chat-UniVi

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers...

34
Emerging
26 FusionBrainLab/OmniFusion

OmniFusion — a multimodal model to communicate using text and images

34
Emerging
27 dimitrismallis/CAD-Assistant

Code for our ICCV 2025 paper "CAD-Assistant: Tool-Augmented VLLMs as Generic...

33
Emerging
28 SALT-NLP/Sketch2Code

Code for the paper: Sketch2Code: Evaluating Vision-Language Models for...

33
Emerging
29 isaaccorley/goldeneye

GoldenEye is a library of geospatial vision-language models -- run any...

32
Emerging
30 Pointcept/GPT4Point

[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language...

32
Emerging
31 wgcyeo/WorldMM

[CVPR 2026] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

32
Emerging
32 MooreThreads/MooER

MooER: Moore-threads Open Omni model for speech-to-speech intERaction....

32
Emerging
33 isjinghao/OralGPT

[NeurIPS'25 | CVPR'26] The official repo of OralGPT & MMOral Bench.

31
Emerging
34 H-Freax/ThinkGrasp

[CoRL2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping...

31
Emerging
35 greenland-dream/video-understanding

This repository provides core code for managing large volumes of video...

31
Emerging
36 worldbench/VideoLucy

[NeurIPS 2025] Deep Memory Backtracking for Long Video Understanding

30
Emerging
37 mbzuai-oryx/LLaVA-pp

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

29
Experimental
38 Open3DA/LL3DA

[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D...

29
Experimental
39 nuldertien/PathBLIP-2

This repository contains all code to support the paper: "On the Importance...

28
Experimental
40 om-ai-lab/ZoomEye

[EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming...

28
Experimental
41 haesleinhuepf/vlm-pictionary

Play pictionary with Vision Language Models!

26
Experimental
42 Hiram31/CADialogue

Official implementation of "CADialogue: A Multimodal LLM-Powered...

26
Experimental
43 FuxiaoLiu/MMC

[NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM...

26
Experimental
44 bigai-nlco/VideoTGB

[EMNLP 2024] A Video Chat Agent with Temporal Prior

25
Experimental
45 WisconsinAIVision/YoChameleon

🦎 Yo'Chameleon: Your Personalized Chameleon (CVPR 2025)

25
Experimental
46 Toommo2/Text2CAD

🚀 Convert natural language to real CAD artifacts with Text2CAD, an...

25
Experimental
47 luxus180/LLaVA-OneVision-1.5

🛠️ Build and train multimodal models easily with LLaVA-OneVision 1.5, an...

25
Experimental
48 DonaldTrump-coder/Informative-Scene-Reconstruction-App

A local software and cloud service system that integrates 3D functionalities...

25
Experimental
49 Piero24/VLM-Object-Detection

A pipeline for object detection and segmentation using a Vision-Language...

25
Experimental
50 Blinorot/ALARM

Official Implementation of "ALARM: Audio–Language Alignment for Reasoning Models"

25
Experimental
51 smsnobin77/Awesome-Multimodal-Unlearning

This repo presents a survey of multimodal unlearning across vision,...

25
Experimental
52 ShareGPT4Omni/ShareGPT4Video

[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving...

24
Experimental
53 yifanlu0227/ChatSim

[CVPR2024 Highlight] Editable Scene Simulation for Autonomous Driving via...

24
Experimental
54 ZPider0/Multimodal

🎤 Transform speech and text with this lightweight Python toolkit for...

24
Experimental
55 showlab/VLog

[CVPR 2025] Video Narration as Vocabulary & Video as Long Document

24
Experimental
56 tenghuilee/ScalingCapFusedVisionLM

number of tokens <=> performance to a vision language model

24
Experimental
57 XduSyL/EventGPT

🔥[CVPR2025] EventGPT: Event Stream Understanding with Multimodal Large...

24
Experimental
58 timmylucy/GLM-ASR

🔊 Enhance speech recognition with GLM-ASR-Nano-2512, a high-performance...

23
Experimental
59 Hyeongkeun/LAVCap

Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual...

23
Experimental
60 fz-zsl/QuatRoPE

The official implementation for CVPR 2026 paper Scalable Object Relation...

22
Experimental
61 OmniMMI/OmniMMI

[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in...

21
Experimental
62 SiyuWang0906/CAD-GPT

[AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial...

20
Experimental
63 ShareGPT4Omni/ShareGPT4V

[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions

20
Experimental
64 anymodality/anymodality

AnyModality is an open-source library to simplify MultiModal LLM inference...

20
Experimental
65 whwu95/FreeVA

FreeVA: Offline MLLM as Training-Free Video Assistant

20
Experimental
66 yophis/partial-yarn

Partial YaRN and VLAT: techniques for efficiently extending audio context of...

19
Experimental
67 hpfield/Text2Touch

CoRL 2025 - Tactile In-Hand Manipulation with LLM-Designed Reward Functions

19
Experimental
68 hamedR96/User-VLM

Personalized Vision Language Models for Social Human-Robot Interactions

19
Experimental
69 termehtaheri/SAR-LM

Official implementation of “SAR-LM: Symbolic Audio Reasoning with Large...

18
Experimental
70 MariyamSiddiqui/Zero-shot-image-to-text-generation-with-BLIP-2

Zero-shot image-to-text generation using Salesforce’s BLIP-2 model —...

17
Experimental
71 alexander-moore/vlm

Composition of Multimodal Language Models From Scratch

17
Experimental
72 InternRobotics/VLM-Grounder

[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

17
Experimental
73 BaohaoLiao/road

[NeurIPS 2024] 3-in-1: 2D Rotary Adaptation for Efficient Finetuning,...

17
Experimental
74 paxnea/LLM-multimodal-nudging

Zero-Shot Learning for Multimodal Nudging

16
Experimental
75 Pittawat2542/driving-assessment-distillation

This repository contains the code and data for the paper "Speed Up!...

16
Experimental
76 Atomic-man007/blip-vision-language

BLIP is a novel Vision-Language Pre-training (VLP) framework designed to...

15
Experimental
77 mariyahendriksen/ecir2022_category_to_image_retrieval

This repository contains the code for the paper "Extending CLIP for...

15
Experimental
78 ais-lab/FaceAIS_REACT24

[FG 2024] Finite Scalar Quantization as Facial Tokenizer for Dyadic Reaction...

15
Experimental
79 yueying-teng/generate-language-image-instruction-following-data

Mistral assisted visual instruction data generation by following LLaVA

14
Experimental
80 sonkd/Visual-Question-Answering-on-VizWiz

Visual Question Answering on VizWiz, A Generative CLIP + LSTM Approach with...

14
Experimental
81 engindeniz/vitis

[ICCV 2023 CLVL Workshop] Zero-Shot and Few-Shot Video Question Answering...

14
Experimental
82 OpenShapeLab/ShapeGPT

ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model, a...

14
Experimental
83 ikun-llm/ikun-V

多模态视觉语言模型 | Vision-Language Model 👁️

14
Experimental
84 zhudotexe/kani-vision

Kani extension for supporting vision-language models (VLMs). Comes with...

13
Experimental
85 Jeremyyny/Value-Spectrum

Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value...

13
Experimental
86 scb-10x/partial-yarn

Partial YaRN and VLAT: techniques for efficiently extending audio context of...

12
Experimental
87 Flagro/OmniModKit

Multimodal LLM toolkit

12
Experimental
88 Jshulgach/Grounded-SAM-2-Stream

Track anything in streaming with Grounding DINO, SAM 2, and LLM

12
Experimental
89 PrateekJannu/Vision-GPT

Coding a Multi-Modal vision model like GPT-4o from scratch, inspired by...

10
Experimental
90 KDEGroup/MMICT

Source code for TOMM'24 paper "MMICT: Boosting Multi-Modal Fine-Tuning with...

10
Experimental
91 mahshid1378/VALL-E

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo...

10
Experimental
92 oncescuandreea/audio_egovlp

This is the official codebase used for obtaining the results in the ICASSP...

10
Experimental