Vision Language Models ML Frameworks

Frameworks and implementations for multimodal models that combine vision and language capabilities, including vision-language transformers, image-text generation, and visual question answering systems. Does NOT include single-modality models, general computer vision frameworks, or task-specific applications like document OCR or license plate recognition.

There are 114 vision language models frameworks tracked. 3 score above 50 (established tier). The highest-rated is facebookresearch/mmf at 62/100 with 5,622 stars. 1 of the top 10 are actively maintained.

Get all 114 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=vision-language-models&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Framework	Score	Tier	Stars	Language
1	facebookresearch/mmf A modular framework for vision & language multimodal research from Facebook...	62	Established	5,622	Python
2	open-mmlab/mmpretrain OpenMMLab Pre-training Toolbox and Benchmark	60	Established	3,837	Python
3	adambielski/siamese-triplet Siamese and triplet networks with online pair/triplet mining in PyTorch	51	Established	3,171	Python
4	pliang279/awesome-multimodal-ml Reading list for research topics in multimodal machine learning	48	Emerging	6,835	—
5	friedrichor/Awesome-Multimodal-Papers A curated list of awesome Multimodal studies.	46	Emerging	317	—
6	mlfoundations/open_flamingo An open-source framework for training large multimodal models.	45	Emerging	4,076	Python
7	HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis Papers, code and datasets about deep learning and multi-modal learning for...	44	Emerging	836	—
8	KaiyangZhou/pytorch-vsumm-reinforce Unsupervised video summarization with deep reinforcement learning (AAAI'18)	44	Emerging	503	Python
9	jingyi0000/VLM_survey Collection of AWESOME vision-language models for vision tasks	43	Emerging	3,094	—
10	kuanghuei/SCAN PyTorch source code for "Stacked Cross Attention for Image-Text Matching" (ECCV 2018)	43	Emerging	579	Python
11	kyegomez/HRTX Multi-Modal Multi-Embodied Hivemind-like Iteration of RTX-2	43	Emerging	15	Python
12	codebyshibsankar/image_triplet_loss Image similarity using Triplet Loss	42	Emerging	102	Jupyter Notebook
13	batra-mlp-lab/visdial [CVPR 2017] Torch code for Visual Dialog	42	Emerging	230	Lua
14	kezhang-cs/Video-Summarization-with-LSTM Implementation of our ECCV 2016 Paper (Video Summarization with Long...	41	Emerging	196	Matlab
15	pliang279/MultiBench [NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning	40	Emerging	615	HTML
16	vbalnt/tfeat TFeat descriptor models for BMVC 2016 paper "Learning local feature...	40	Emerging	150	Jupyter Notebook
17	willxxy/awesome-mmps Corpus of resources for multimodal machine learning with physiological...	39	Emerging	151	—
18	kyegomez/Med-PaLM Towards Generalist Biomedical AI	38	Emerging	435	Python
19	kyegomez/Fuyu Implementation of Adepts Fuyu all-new Multi-Modality model in pytorch	38	Emerging	24	Python
20	nekhtiari/image-similarity-measures :chart_with_upwards_trend: Implementation of eight evaluation metrics to...	38	Emerging	641	Python
21	landskape-ai/triplet-attention Official PyTorch Implementation for "Rotate to Attend: Convolutional Triplet...	37	Emerging	439	Jupyter Notebook
22	Cloud-CV/VQA CloudCV Visual Question Answering Demo	37	Emerging	67	Lua
23	thubZ09/vision-language-model-research Hub for researchers exploring VLMs and Multimodal Learning:)	36	Emerging	62	—
24	OpenBioLink/ThoughtSource A central, open resource for data and tools related to chain-of-thought...	36	Emerging	1,015	Jupyter Notebook
25	Cadene/vqa.pytorch Visual Question Answering in Pytorch	36	Emerging	735	Python
26	aioz-ai/CFR_VQA Coarse-to-Fine Reasoning for Visual Question Answering (CVPRW'22)	35	Emerging	49	Python
27	thuiar/MIntRec MIntRec: A New Dataset for Multimodal Intent Recognition (ACM MM 2022)	35	Emerging	129	Python
28	maruya24/pytorch_robotics_transformer A PyTorch re-implementation of the RT-1 (Robotics Transformer)	35	Emerging	51	Python
29	ManifoldRG/NEKO Implementation of GATO style Generalist Multimodal model capable of image,...	34	Emerging	45	Python
30	yuanze-lin/REVIVE [NeurIPS 2022] Official code for REVIVE: Regional Visual Representation...	34	Emerging	105	Python
31	abhshkdz/neural-vqa :grey_question: Visual Question Answering in Torch	34	Emerging	488	Lua
32	thswodnjs3/CSTA The official code of "CSTA: CNN-based Spatiotemporal Attention for Video...	33	Emerging	68	Python
33	monjurulkarim/DSTA This is the implementation code for the paper, "A Dynamic Spatial-temporal...	33	Emerging	37	Python
34	aioz-ai/MICCAI21_MMQ Multiple Meta-model Quantifying for Medical Visual Question Answering (MICCAI 2021)	33	Emerging	37	Python
35	mlbio-epfl/joint-inference [ICLR 2025] Large (Vision) Language Models are Unsupervised In-Context Learners	33	Emerging	22	Python
36	IBM/AdaMML Official implementation of AdaMML. https://arxiv.org/abs/2105.05165.	33	Emerging	51	Python
37	abhshkdz/neural-vqa-attention :question: Attention-based Visual Question Answering in Torch	31	Emerging	101	Jupyter Notebook
38	neulab/CulturalGround This repository provides the official resources for EMNLP 2025 Paper...	31	Emerging	12	Python
39	TIGER-AI-Lab/VideoScore official repo for "VideoScore: Building Automatic Metrics to Simulate...	31	Emerging	113	Python
40	subho406/OmniNet Official Pytorch implementation of "OmniNet: A unified architecture for...	29	Experimental	513	Python
41	williamcfrancis/Visual-Question-Answering-using-Stacked-Attention-Networks Pytorch implementation of VQA using Stacked Attention Networks: Multimodal...	29	Experimental	8	Jupyter Notebook
42	zchuz/CoT-Reasoning-Survey [ACL 2024] A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future	29	Experimental	493	—
43	RManLuo/MAMDR Official code implementation for ICDE 23 paper MAMDR: A Model Agnostic...	28	Experimental	38	Python
44	fansunqi/VideoTool Official Repository for NeurIPS'25 Paper "Tool-Augmented Spatiotemporal...	28	Experimental	17	Python
45	etornam45/vl-jepa This VL-JEPA implimentation takes direct insperation from the original VL-JEPA paper	28	Experimental	7	Python
46	real-stanford/semantic-abstraction [CoRL 2022] This repository contains code for generating relevancies,...	28	Experimental	115	Python
47	pliang279/MultiViz [ICLR 2023] MultiViz: Towards Visualizing and Understanding Multimodal Models	27	Experimental	99	Python
48	invictus717/MiCo [ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale	27	Experimental	124	Python
49	tgxs002/wikiscenes Towers of Babel: Combining Images, Language, and 3D Geometry for Learning...	27	Experimental	43	Python
50	pranv/ARC Code for Attentive Recurrent Comparators	27	Experimental	58	Python
51	nerdimite/neuro-symbolic-ai-soc Neuro-Symbolic Visual Question Answering on Sort-of-CLEVR using PyTorch	27	Experimental	60	Jupyter Notebook
52	Jakobovski/decoupled-multimodal-learning A decoupled, generative, unsupervised, multimodal neural architecture.	26	Experimental	44	Python
53	AlwaysFHao/TiM4Rec [Neurocomputing 2025] The code for the paper "TiM4Rec: An Efficient...	26	Experimental	34	Python
54	Skyyyy0920/MTNet Code implementation for our paper "Learning Time Slot Preferences via...	25	Experimental	22	Python
55	imneonizer/pytorch-triplet-loss Birds 400-Species Image Classification using Pytorch Metric Learning...	25	Experimental	13	Jupyter Notebook
56	kyegomez/AutoRT Implementation of AutoRT: "AutoRT: Embodied Foundation Models for Large...	25	Experimental	42	Python
57	tensorpix/benchmarking-cv-models Benchmark computer vision ML models in 3 minutes	25	Experimental	33	Python
58	Soumya-Chakraborty/Unsupervised-video-summarization-with-deep-GAN-reinforcement-learning Unsupervised video summarization with deep(GAN) reinforcement learning	25	Experimental	6	Python
59	Rishit-dagli/Astroformer This repository contains the official implementation of Astroformer, an ICLR...	25	Experimental	31	Python
60	le-liang/Multimodal-Wireless Python scripts and assets related to Multimodal-Wireless dataset. The...	25	Experimental	18	Python
61	kyegomez/MMCA-MGQA Experiments around using Multi-Modal Casual Attention with Multi-Grouped...	24	Experimental	5	Python
62	cpystan/WSI-VQA [ECCV 2024] Official Implementation of 《WSI-VQA: Interpreting Whole Slide...	23	Experimental	61	Python
63	AceCHQ/MMIQ This repo contains evaluation code for MM-IQ benchmark.	23	Experimental	10	Jupyter Notebook
64	VectorInstitute/VLDBench VLDBench: A large-scale benchmark for evaluating Vision-Language Models...	23	Experimental	8	Python
65	ViLab-UCSD/LaGTran_ICML2024 Code and models for the ICML 2024 paper "Tell, Don`t Show!: Language...	23	Experimental	6	Python
66	liveseongho/Awesome-Video-Language-Understanding A Survey on video and language understanding.	22	Experimental	50	—
67	ntkhoa95/multimodal-for-vision Vision Framework: A modular multi-agent system for computer vision tasks,...	22	Experimental	7	Python
68	iluvn01/VFMTok 🖼️ Leverage vision foundation models to transform visual data into effective...	22	Experimental	—	Python
69	raminguyen/LLMP2 Evaluating ‘Graphical Perception’ with Multimodal Large Language Models	22	Experimental	3	Jupyter Notebook
70	Peachypie98/CBAM CBAM: Convolutional Block Attention Module for CIFAR100 on VGG19	22	Experimental	80	Jupyter Notebook
71	lilygeorgescu/MHCA Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for...	22	Experimental	51	Python
72	yousefkotp/Visual-Question-Answering A Light weight deep learning model with with a web application to answer...	22	Experimental	14	Jupyter Notebook
73	vtu81/NaiveVQA A Visual Question Answering model implemented in MindSpore and PyTorch. The...	21	Experimental	10	Jupyter Notebook
74	zamaex96/Hybrid-CNN-LSTM-with-Spatial-Attention This documents the training and evaluation of a Hybrid CNN-LSTM Attention...	21	Experimental	33	Python
75	RobotiXX/multimodal-fusion-network This repository contains all the code for Parsing, Transforming and Training...	20	Experimental	15	Python
76	VQA-Team/Visual-Question-Answering The project is an Android application aimed to help the visually impaired by...	20	Experimental	7	Jupyter Notebook
77	schwettmann/visual-vocab Pytorch-based tools for constructing a vocabulary of visual concepts in a GAN.	20	Experimental	17	Jupyter Notebook
78	uakarsh/med-vqa An approach for solving the problem of medical visual question answering	20	Experimental	7	Jupyter Notebook
79	kyegomez/NeVA The open source implementation of "NeVA: NeMo Vision and Language Assistant"	20	Experimental	17	Python
80	kyegomez/MultiModal-ToT Multi-Modal Tree of thoughts for DALLE-3 like auto self improvement	20	Experimental	17	Python
81	rahuldevmuraleedharan/Neural-Navigator Multi-modal Transformer that fuses vision and language to generate...	19	Experimental	—	Python
82	yuhui-zh15/VLMClassifier Official implementation of "Why are Visually-Grounded Language Models Bad at...	18	Experimental	97	Jupyter Notebook
83	naamiinepal/tunevlseg [ACCV 2024]: TuneVLSeg: Prompt Tuning Benchmark for Vision-Language...	18	Experimental	8	Jupyter Notebook
84	projectayre/ayre Visual Question Answering with added novel Semantic Analysis approach....	17	Experimental	4	Jupyter Notebook
85	fansunqi/AKeyS Agentic Keyframe Search for Video Question Answering	17	Experimental	16	Python
86	clear-nus/MuMMI Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised...	17	Experimental	13	Python
87	aaaastark/hybrid-model-with-cnn-lstm-python Hybrid Model with CNN and LSTM for VMD dataset using Python	16	Experimental	4	—
88	cronenberg64/VLM-arch Systematic benchmarking of modern vision backbones under small-data...	16	Experimental	1	Jupyter Notebook
89	SriramPingali/Multi-Modal-Recommendation-System Official code for the paper "Towards developing a Multi Modal Video...	16	Experimental	18	Jupyter Notebook
90	rkl71/MambaRec [CIKM 2025] Source code for "Modality Alignment with Multi-scale Bilateral...	16	Experimental	9	Python
91	ankitsharma-tech/Image-Triplet-Loss Image similarity using Triplet Loss.	16	Experimental	9	Jupyter Notebook
92	guyyariv/vLMIG This repo contains the official PyTorch implementation of vLMIG: Improving...	15	Experimental	17	Python
93	kyegomez/VisionLLaMA Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA...	15	Experimental	16	Python
94	jesusp1234/multimodal-benchmarks 🎯 Benchmark retrieval systems across video, image, audio, and documents with...	14	Experimental	—	Python
95	anggaumhar/dynamicvl 🌆 Benchmark multimodal large language models to enhance understanding of...	14	Experimental	—	—
96	anujanegi/VQA Visual Question Answering System	14	Experimental	11	Python
97	darkmax159159357/TypeR-models ⚠️ DEPRECATED — Merged into darkmax159159357/TypeR. See main repo for all...	14	Experimental	—	—
98	soominmyung/Pairwise_Siamese_transformer Pairwise Preference Learning with Siamese Transformer Encoders	14	Experimental	—	Jupyter Notebook
99	ipoukoumondi/IWR-Bench 🌐 Evaluate LVLMs' ability to reconstruct dynamic, interactive webpages from...	14	Experimental	—	Python
100	Soumya-Chakraborty/VL-JEPA VL-JEPA Joint Embedding Predictive Architecture for Vision-language...	13	Experimental	2	Python
101	RobinDong/tiny_multimodal Tiny and simple implementation of multimodal models	13	Experimental	8	Python
102	google/crossmodal-3600 Crossmodal-3600 dataset	13	Experimental	10	HTML
103	holylovenia/awesome-multimodal-convai Paper reading list for Multimodal Conversational AI	12	Experimental	4	—
104	YeLuoSuiYou/openstorypp We introduce OpenStory++, a large-scale open-domain dataset focusing on...	12	Experimental	17	Python
105	tristandb8/PyTorch-PaliGemma-2 PyTorch implementation of PaliGemma 2	12	Experimental	4	Python
106	ved1beta/Paligemma vision language model	12	Experimental	3	Python
107	aiden200/VLM_Implementation Implementing a Video Language Model from scratch	12	Experimental	3	Python
108	Hodasia/Awesome-Vision-Language-Finetune Awesome List of Vision Language Prompt Papers	12	Experimental	47	—
109	alsaniie/Image-Similarity-Index-SSIM-analysis- In image processing, an image similarity index, also known as a similarity...	12	Experimental	3	—
110	lyuchenyang/Semantic-aware-VideoQA Code for ACL SRW 2023 paepr "Semantic-aware Dynamic...	12	Experimental	3	Python
111	Dafterfly/Quick_Vilt A CLI and GUI for using the Vision-and-Language Transformer (ViLT) model for...	12	Experimental	3	Python
112	MichiganNLP/wildqa WildQA website code	11	Experimental	—	HTML
113	lyuchenyang/Efficient-VideoQA Code for ACL SustaiNLP 2023 paper "Is a Video worth n × n Images? A Highly...	11	Experimental	2	Python
114	MohEsmail143/vizwiz-visual-question-answering An implementation of the paper "Less is More", which was used to attempt the...	10	Experimental	1	Jupyter Notebook