Multimodal Visual Grounding NLP Tools
Tools for grounding natural language in visual content (images, video, 3D scenes), including visual question answering, object localization, and cross-modal retrieval. Does NOT include general image captioning, multimodal pretraining without grounding focus, or speech-only cross-modal tasks.
There are 25 multimodal visual grounding tools tracked. The highest-rated is TheShadow29/awesome-grounding at 40/100 with 1,125 stars.
Get all 25 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=multimodal-visual-grounding&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
TheShadow29/awesome-grounding
awesome grounding: A curated list of research papers in visual grounding |
|
Emerging |
| 2 |
microsoft/XPretrain
Multi-modality pre-training |
|
Emerging |
| 3 |
TheShadow29/zsgnet-pytorch
Official implementation of ICCV19 oral paper Zero-Shot grounding of Objects... |
|
Emerging |
| 4 |
TheShadow29/VidSitu
[CVPR21] Visual Semantic Role Labeling for Video Understanding... |
|
Emerging |
| 5 |
zeyofu/BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large... |
|
Emerging |
| 6 |
qaixerabbas/awesome-multimodal-learning-with-imperfect-data
Multimodal Representation Learning under Imperfect Data Conditions: A Survey |
|
Emerging |
| 7 |
gicheonkang/sglkt-visdial
🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with... |
|
Experimental |
| 8 |
MiuLab/DuaLUG
The implementation of the papers on dual learning of natural language... |
|
Experimental |
| 9 |
princeton-nlp/XTX
[ICLR 2022 Spotlight] Multi-Stage Episodic Control for Strategic Exploration... |
|
Experimental |
| 10 |
SkalskiP/awesome-foundation-and-multimodal-models
👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper... |
|
Experimental |
| 11 |
MichiganNLP/Scalable-VLM-Probing
Probe Vision-Language Models |
|
Experimental |
| 12 |
fork123aniket/Graph-Neural-Network-based-Visual-Question-Answering
Implementation of GNNs for Visual Question Answering task in PyTorch |
|
Experimental |
| 13 |
1989Ryan/paragon
[ICRA 2023] Differentiable parsing and visual grounding of natural language... |
|
Experimental |
| 14 |
tim-dickey/multi-modal-neural-network
Multi-modal neural network with double-loop learning that fuses vision and... |
|
Experimental |
| 15 |
benywon/LALM
code and resource for ACL2021 paper 'Multi-Lingual Question Generation with... |
|
Experimental |
| 16 |
workforyou786/Large-Language-Model-Research-Paper
Multimodal AI — systems that can understand and generate information across... |
|
Experimental |
| 17 |
aimagelab/JARVIS
Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large... |
|
Experimental |
| 18 |
thunlp/cost-optimal-gqa
The code for the paper "Cost-Optimal Grouped-Query Attention for... |
|
Experimental |
| 19 |
PRITHIVSAKTHIUR/Molmo2-HF-Demo
A Gradio-based demonstration for the AllenAI Molmo2-8B multimodal model,... |
|
Experimental |
| 20 |
aistairc/VDAct
A Video-grounded Dialogue Dataset and Metric for Event-driven Activities |
|
Experimental |
| 21 |
candacelax/grounded-vision-parser
semantic parser trained by using videos only instead of labeled logical forms |
|
Experimental |
| 22 |
psunlpgroup/MPlanner
ACL2025-Findings paper "Enhance Multimodal Consistency and Coherence for... |
|
Experimental |
| 23 |
zoppellarielena/Paper-Presentation-for-Natural-Language-Processing
This presentation, conducted for the "Natural Language Processing" course,... |
|
Experimental |
| 24 |
ChenBarryHu/TransformerVG
TransformerVG - 3D Visual Grounding with Transformers |
|
Experimental |
| 25 |
nmhongtram/gnn-surgical-understanding
Graph Reasoning for Visual Question Answering in Laparoscopic Scene Understanding |
|
Experimental |