Multimodal Visual Grounding NLP Tools

Tools for grounding natural language in visual content (images, video, 3D scenes), including visual question answering, object localization, and cross-modal retrieval. Does NOT include general image captioning, multimodal pretraining without grounding focus, or speech-only cross-modal tasks.

There are 25 multimodal visual grounding tools tracked. The highest-rated is TheShadow29/awesome-grounding at 40/100 with 1,125 stars.

Get all 25 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=multimodal-visual-grounding&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	TheShadow29/awesome-grounding awesome grounding: A curated list of research papers in visual grounding	40	Emerging	1,125	—
2	microsoft/XPretrain Multi-modality pre-training	34	Emerging	510	Python
3	TheShadow29/zsgnet-pytorch Official implementation of ICCV19 oral paper Zero-Shot grounding of Objects...	34	Emerging	72	Python
4	TheShadow29/VidSitu [CVPR21] Visual Semantic Role Labeling for Video Understanding...	31	Emerging	61	Python
5	zeyofu/BLINK_Benchmark This repo contains evaluation code for the paper "BLINK: Multimodal Large...	30	Emerging	164	Python
6	qaixerabbas/awesome-multimodal-learning-with-imperfect-data Multimodal Representation Learning under Imperfect Data Conditions: A Survey	30	Emerging	2	—
7	gicheonkang/sglkt-visdial 🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with...	29	Experimental	13	Python
8	MiuLab/DuaLUG The implementation of the papers on dual learning of natural language...	27	Experimental	67	Python
9	princeton-nlp/XTX [ICLR 2022 Spotlight] Multi-Stage Episodic Control for Strategic Exploration...	27	Experimental	15	Python
10	SkalskiP/awesome-foundation-and-multimodal-models 👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper...	26	Experimental	638	Python
11	MichiganNLP/Scalable-VLM-Probing Probe Vision-Language Models	25	Experimental	5	Python
12	fork123aniket/Graph-Neural-Network-based-Visual-Question-Answering Implementation of GNNs for Visual Question Answering task in PyTorch	25	Experimental	3	Python
13	1989Ryan/paragon [ICRA 2023] Differentiable parsing and visual grounding of natural language...	23	Experimental	6	Python
14	tim-dickey/multi-modal-neural-network Multi-modal neural network with double-loop learning that fuses vision and...	20	Experimental	1	Python
15	benywon/LALM code and resource for ACL2021 paper 'Multi-Lingual Question Generation with...	19	Experimental	5	Python
16	workforyou786/Large-Language-Model-Research-Paper Multimodal AI — systems that can understand and generate information across...	19	Experimental	—	—
17	aimagelab/JARVIS Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large...	18	Experimental	6	Python
18	thunlp/cost-optimal-gqa The code for the paper "Cost-Optimal Grouped-Query Attention for...	18	Experimental	4	Python
19	PRITHIVSAKTHIUR/Molmo2-HF-Demo A Gradio-based demonstration for the AllenAI Molmo2-8B multimodal model,...	18	Experimental	4	Python
20	aistairc/VDAct A Video-grounded Dialogue Dataset and Metric for Event-driven Activities	15	Experimental	5	Python
21	candacelax/grounded-vision-parser semantic parser trained by using videos only instead of labeled logical forms	15	Experimental	6	Java
22	psunlpgroup/MPlanner ACL2025-Findings paper "Enhance Multimodal Consistency and Coherence for...	14	Experimental	3	Python
23	zoppellarielena/Paper-Presentation-for-Natural-Language-Processing This presentation, conducted for the "Natural Language Processing" course,...	11	Experimental	—	—
24	ChenBarryHu/TransformerVG TransformerVG - 3D Visual Grounding with Transformers	11	Experimental	2	Python
25	nmhongtram/gnn-surgical-understanding Graph Reasoning for Visual Question Answering in Laparoscopic Scene Understanding	10	Experimental	3	Jupyter Notebook