Multimodal Visual Grounding NLP Tools

Tools for grounding natural language in visual content (images, video, 3D scenes), including visual question answering, object localization, and cross-modal retrieval. Does NOT include general image captioning, multimodal pretraining without grounding focus, or speech-only cross-modal tasks.

There are 25 multimodal visual grounding tools tracked. The highest-rated is TheShadow29/awesome-grounding at 40/100 with 1,125 stars.

Get all 25 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=multimodal-visual-grounding&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 TheShadow29/awesome-grounding

awesome grounding: A curated list of research papers in visual grounding

40
Emerging
2 microsoft/XPretrain

Multi-modality pre-training

34
Emerging
3 TheShadow29/zsgnet-pytorch

Official implementation of ICCV19 oral paper Zero-Shot grounding of Objects...

34
Emerging
4 TheShadow29/VidSitu

[CVPR21] Visual Semantic Role Labeling for Video Understanding...

31
Emerging
5 zeyofu/BLINK_Benchmark

This repo contains evaluation code for the paper "BLINK: Multimodal Large...

30
Emerging
6 qaixerabbas/awesome-multimodal-learning-with-imperfect-data

Multimodal Representation Learning under Imperfect Data Conditions: A Survey

30
Emerging
7 gicheonkang/sglkt-visdial

🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with...

29
Experimental
8 MiuLab/DuaLUG

The implementation of the papers on dual learning of natural language...

27
Experimental
9 princeton-nlp/XTX

[ICLR 2022 Spotlight] Multi-Stage Episodic Control for Strategic Exploration...

27
Experimental
10 SkalskiP/awesome-foundation-and-multimodal-models

👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper...

26
Experimental
11 MichiganNLP/Scalable-VLM-Probing

Probe Vision-Language Models

25
Experimental
12 fork123aniket/Graph-Neural-Network-based-Visual-Question-Answering

Implementation of GNNs for Visual Question Answering task in PyTorch

25
Experimental
13 1989Ryan/paragon

[ICRA 2023] Differentiable parsing and visual grounding of natural language...

23
Experimental
14 tim-dickey/multi-modal-neural-network

Multi-modal neural network with double-loop learning that fuses vision and...

20
Experimental
15 benywon/LALM

code and resource for ACL2021 paper 'Multi-Lingual Question Generation with...

19
Experimental
16 workforyou786/Large-Language-Model-Research-Paper

Multimodal AI — systems that can understand and generate information across...

19
Experimental
17 aimagelab/JARVIS

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large...

18
Experimental
18 thunlp/cost-optimal-gqa

The code for the paper "Cost-Optimal Grouped-Query Attention for...

18
Experimental
19 PRITHIVSAKTHIUR/Molmo2-HF-Demo

A Gradio-based demonstration for the AllenAI Molmo2-8B multimodal model,...

18
Experimental
20 aistairc/VDAct

A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

15
Experimental
21 candacelax/grounded-vision-parser

semantic parser trained by using videos only instead of labeled logical forms

15
Experimental
22 psunlpgroup/MPlanner

ACL2025-Findings paper "Enhance Multimodal Consistency and Coherence for...

14
Experimental
23 zoppellarielena/Paper-Presentation-for-Natural-Language-Processing

This presentation, conducted for the "Natural Language Processing" course,...

11
Experimental
24 ChenBarryHu/TransformerVG

TransformerVG - 3D Visual Grounding with Transformers

11
Experimental
25 nmhongtram/gnn-surgical-understanding

Graph Reasoning for Visual Question Answering in Laparoscopic Scene Understanding

10
Experimental