Vision Language Models ML Frameworks
Frameworks and implementations for multimodal models that combine vision and language capabilities, including vision-language transformers, image-text generation, and visual question answering systems. Does NOT include single-modality models, general computer vision frameworks, or task-specific applications like document OCR or license plate recognition.
There are 114 vision language models frameworks tracked. 3 score above 50 (established tier). The highest-rated is facebookresearch/mmf at 62/100 with 5,622 stars. 1 of the top 10 are actively maintained.
Get all 114 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=vision-language-models&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Framework | Score | Tier |
|---|---|---|---|
| 1 |
facebookresearch/mmf
A modular framework for vision & language multimodal research from Facebook... |
|
Established |
| 2 |
open-mmlab/mmpretrain
OpenMMLab Pre-training Toolbox and Benchmark |
|
Established |
| 3 |
adambielski/siamese-triplet
Siamese and triplet networks with online pair/triplet mining in PyTorch |
|
Established |
| 4 |
pliang279/awesome-multimodal-ml
Reading list for research topics in multimodal machine learning |
|
Emerging |
| 5 |
friedrichor/Awesome-Multimodal-Papers
A curated list of awesome Multimodal studies. |
|
Emerging |
| 6 |
mlfoundations/open_flamingo
An open-source framework for training large multimodal models. |
|
Emerging |
| 7 |
HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis
Papers, code and datasets about deep learning and multi-modal learning for... |
|
Emerging |
| 8 |
KaiyangZhou/pytorch-vsumm-reinforce
Unsupervised video summarization with deep reinforcement learning (AAAI'18) |
|
Emerging |
| 9 |
jingyi0000/VLM_survey
Collection of AWESOME vision-language models for vision tasks |
|
Emerging |
| 10 |
kuanghuei/SCAN
PyTorch source code for "Stacked Cross Attention for Image-Text Matching" (ECCV 2018) |
|
Emerging |
| 11 |
kyegomez/HRTX
Multi-Modal Multi-Embodied Hivemind-like Iteration of RTX-2 |
|
Emerging |
| 12 |
codebyshibsankar/image_triplet_loss
Image similarity using Triplet Loss |
|
Emerging |
| 13 |
batra-mlp-lab/visdial
[CVPR 2017] Torch code for Visual Dialog |
|
Emerging |
| 14 |
kezhang-cs/Video-Summarization-with-LSTM
Implementation of our ECCV 2016 Paper (Video Summarization with Long... |
|
Emerging |
| 15 |
pliang279/MultiBench
[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning |
|
Emerging |
| 16 |
vbalnt/tfeat
TFeat descriptor models for BMVC 2016 paper "Learning local feature... |
|
Emerging |
| 17 |
willxxy/awesome-mmps
Corpus of resources for multimodal machine learning with physiological... |
|
Emerging |
| 18 |
kyegomez/Med-PaLM
Towards Generalist Biomedical AI |
|
Emerging |
| 19 |
kyegomez/Fuyu
Implementation of Adepts Fuyu all-new Multi-Modality model in pytorch |
|
Emerging |
| 20 |
nekhtiari/image-similarity-measures
:chart_with_upwards_trend: Implementation of eight evaluation metrics to... |
|
Emerging |
| 21 |
landskape-ai/triplet-attention
Official PyTorch Implementation for "Rotate to Attend: Convolutional Triplet... |
|
Emerging |
| 22 |
Cloud-CV/VQA
CloudCV Visual Question Answering Demo |
|
Emerging |
| 23 |
thubZ09/vision-language-model-research
Hub for researchers exploring VLMs and Multimodal Learning:) |
|
Emerging |
| 24 |
OpenBioLink/ThoughtSource
A central, open resource for data and tools related to chain-of-thought... |
|
Emerging |
| 25 |
Cadene/vqa.pytorch
Visual Question Answering in Pytorch |
|
Emerging |
| 26 |
aioz-ai/CFR_VQA
Coarse-to-Fine Reasoning for Visual Question Answering (CVPRW'22) |
|
Emerging |
| 27 |
thuiar/MIntRec
MIntRec: A New Dataset for Multimodal Intent Recognition (ACM MM 2022) |
|
Emerging |
| 28 |
maruya24/pytorch_robotics_transformer
A PyTorch re-implementation of the RT-1 (Robotics Transformer) |
|
Emerging |
| 29 |
ManifoldRG/NEKO
Implementation of GATO style Generalist Multimodal model capable of image,... |
|
Emerging |
| 30 |
yuanze-lin/REVIVE
[NeurIPS 2022] Official code for REVIVE: Regional Visual Representation... |
|
Emerging |
| 31 |
abhshkdz/neural-vqa
:grey_question: Visual Question Answering in Torch |
|
Emerging |
| 32 |
thswodnjs3/CSTA
The official code of "CSTA: CNN-based Spatiotemporal Attention for Video... |
|
Emerging |
| 33 |
monjurulkarim/DSTA
This is the implementation code for the paper, "A Dynamic Spatial-temporal... |
|
Emerging |
| 34 |
aioz-ai/MICCAI21_MMQ
Multiple Meta-model Quantifying for Medical Visual Question Answering (MICCAI 2021) |
|
Emerging |
| 35 |
mlbio-epfl/joint-inference
[ICLR 2025] Large (Vision) Language Models are Unsupervised In-Context Learners |
|
Emerging |
| 36 |
IBM/AdaMML
Official implementation of AdaMML. https://arxiv.org/abs/2105.05165. |
|
Emerging |
| 37 |
abhshkdz/neural-vqa-attention
:question: Attention-based Visual Question Answering in Torch |
|
Emerging |
| 38 |
neulab/CulturalGround
This repository provides the official resources for EMNLP 2025 Paper... |
|
Emerging |
| 39 |
TIGER-AI-Lab/VideoScore
official repo for "VideoScore: Building Automatic Metrics to Simulate... |
|
Emerging |
| 40 |
subho406/OmniNet
Official Pytorch implementation of "OmniNet: A unified architecture for... |
|
Experimental |
| 41 |
williamcfrancis/Visual-Question-Answering-using-Stacked-Attention-Networks
Pytorch implementation of VQA using Stacked Attention Networks: Multimodal... |
|
Experimental |
| 42 |
zchuz/CoT-Reasoning-Survey
[ACL 2024] A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future |
|
Experimental |
| 43 |
RManLuo/MAMDR
Official code implementation for ICDE 23 paper MAMDR: A Model Agnostic... |
|
Experimental |
| 44 |
fansunqi/VideoTool
Official Repository for NeurIPS'25 Paper "Tool-Augmented Spatiotemporal... |
|
Experimental |
| 45 |
etornam45/vl-jepa
This VL-JEPA implimentation takes direct insperation from the original VL-JEPA paper |
|
Experimental |
| 46 |
real-stanford/semantic-abstraction
[CoRL 2022] This repository contains code for generating relevancies,... |
|
Experimental |
| 47 |
pliang279/MultiViz
[ICLR 2023] MultiViz: Towards Visualizing and Understanding Multimodal Models |
|
Experimental |
| 48 |
invictus717/MiCo
[ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale |
|
Experimental |
| 49 |
tgxs002/wikiscenes
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning... |
|
Experimental |
| 50 |
pranv/ARC
Code for Attentive Recurrent Comparators |
|
Experimental |
| 51 |
nerdimite/neuro-symbolic-ai-soc
Neuro-Symbolic Visual Question Answering on Sort-of-CLEVR using PyTorch |
|
Experimental |
| 52 |
Jakobovski/decoupled-multimodal-learning
A decoupled, generative, unsupervised, multimodal neural architecture. |
|
Experimental |
| 53 |
AlwaysFHao/TiM4Rec
[Neurocomputing 2025] The code for the paper "TiM4Rec: An Efficient... |
|
Experimental |
| 54 |
Skyyyy0920/MTNet
Code implementation for our paper "Learning Time Slot Preferences via... |
|
Experimental |
| 55 |
imneonizer/pytorch-triplet-loss
Birds 400-Species Image Classification using Pytorch Metric Learning... |
|
Experimental |
| 56 |
kyegomez/AutoRT
Implementation of AutoRT: "AutoRT: Embodied Foundation Models for Large... |
|
Experimental |
| 57 |
tensorpix/benchmarking-cv-models
Benchmark computer vision ML models in 3 minutes |
|
Experimental |
| 58 |
Soumya-Chakraborty/Unsupervised-video-summarization-with-deep-GAN-reinforcement-learning
Unsupervised video summarization with deep(GAN) reinforcement learning |
|
Experimental |
| 59 |
Rishit-dagli/Astroformer
This repository contains the official implementation of Astroformer, an ICLR... |
|
Experimental |
| 60 |
le-liang/Multimodal-Wireless
Python scripts and assets related to Multimodal-Wireless dataset. The... |
|
Experimental |
| 61 |
kyegomez/MMCA-MGQA
Experiments around using Multi-Modal Casual Attention with Multi-Grouped... |
|
Experimental |
| 62 |
cpystan/WSI-VQA
[ECCV 2024] Official Implementation of 《WSI-VQA: Interpreting Whole Slide... |
|
Experimental |
| 63 |
AceCHQ/MMIQ
This repo contains evaluation code for MM-IQ benchmark. |
|
Experimental |
| 64 |
VectorInstitute/VLDBench
VLDBench: A large-scale benchmark for evaluating Vision-Language Models... |
|
Experimental |
| 65 |
ViLab-UCSD/LaGTran_ICML2024
Code and models for the ICML 2024 paper "Tell, Don`t Show!: Language... |
|
Experimental |
| 66 |
liveseongho/Awesome-Video-Language-Understanding
A Survey on video and language understanding. |
|
Experimental |
| 67 |
ntkhoa95/multimodal-for-vision
Vision Framework: A modular multi-agent system for computer vision tasks,... |
|
Experimental |
| 68 |
iluvn01/VFMTok
🖼️ Leverage vision foundation models to transform visual data into effective... |
|
Experimental |
| 69 |
raminguyen/LLMP2
Evaluating ‘Graphical Perception’ with Multimodal Large Language Models |
|
Experimental |
| 70 |
Peachypie98/CBAM
CBAM: Convolutional Block Attention Module for CIFAR100 on VGG19 |
|
Experimental |
| 71 |
lilygeorgescu/MHCA
Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for... |
|
Experimental |
| 72 |
yousefkotp/Visual-Question-Answering
A Light weight deep learning model with with a web application to answer... |
|
Experimental |
| 73 |
vtu81/NaiveVQA
A Visual Question Answering model implemented in MindSpore and PyTorch. The... |
|
Experimental |
| 74 |
zamaex96/Hybrid-CNN-LSTM-with-Spatial-Attention
This documents the training and evaluation of a Hybrid CNN-LSTM Attention... |
|
Experimental |
| 75 |
RobotiXX/multimodal-fusion-network
This repository contains all the code for Parsing, Transforming and Training... |
|
Experimental |
| 76 |
VQA-Team/Visual-Question-Answering
The project is an Android application aimed to help the visually impaired by... |
|
Experimental |
| 77 |
schwettmann/visual-vocab
Pytorch-based tools for constructing a vocabulary of visual concepts in a GAN. |
|
Experimental |
| 78 |
uakarsh/med-vqa
An approach for solving the problem of medical visual question answering |
|
Experimental |
| 79 |
kyegomez/NeVA
The open source implementation of "NeVA: NeMo Vision and Language Assistant" |
|
Experimental |
| 80 |
kyegomez/MultiModal-ToT
Multi-Modal Tree of thoughts for DALLE-3 like auto self improvement |
|
Experimental |
| 81 |
rahuldevmuraleedharan/Neural-Navigator
Multi-modal Transformer that fuses vision and language to generate... |
|
Experimental |
| 82 |
yuhui-zh15/VLMClassifier
Official implementation of "Why are Visually-Grounded Language Models Bad at... |
|
Experimental |
| 83 |
naamiinepal/tunevlseg
[ACCV 2024]: TuneVLSeg: Prompt Tuning Benchmark for Vision-Language... |
|
Experimental |
| 84 |
projectayre/ayre
Visual Question Answering with added novel Semantic Analysis approach.... |
|
Experimental |
| 85 |
fansunqi/AKeyS
Agentic Keyframe Search for Video Question Answering |
|
Experimental |
| 86 |
clear-nus/MuMMI
Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised... |
|
Experimental |
| 87 |
aaaastark/hybrid-model-with-cnn-lstm-python
Hybrid Model with CNN and LSTM for VMD dataset using Python |
|
Experimental |
| 88 |
cronenberg64/VLM-arch
Systematic benchmarking of modern vision backbones under small-data... |
|
Experimental |
| 89 |
SriramPingali/Multi-Modal-Recommendation-System
Official code for the paper "Towards developing a Multi Modal Video... |
|
Experimental |
| 90 |
rkl71/MambaRec
[CIKM 2025] Source code for "Modality Alignment with Multi-scale Bilateral... |
|
Experimental |
| 91 |
ankitsharma-tech/Image-Triplet-Loss
Image similarity using Triplet Loss. |
|
Experimental |
| 92 |
guyyariv/vLMIG
This repo contains the official PyTorch implementation of vLMIG: Improving... |
|
Experimental |
| 93 |
kyegomez/VisionLLaMA
Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA... |
|
Experimental |
| 94 |
jesusp1234/multimodal-benchmarks
🎯 Benchmark retrieval systems across video, image, audio, and documents with... |
|
Experimental |
| 95 |
anggaumhar/dynamicvl
🌆 Benchmark multimodal large language models to enhance understanding of... |
|
Experimental |
| 96 |
anujanegi/VQA
Visual Question Answering System |
|
Experimental |
| 97 |
darkmax159159357/TypeR-models
⚠️ DEPRECATED — Merged into darkmax159159357/TypeR. See main repo for all... |
|
Experimental |
| 98 |
soominmyung/Pairwise_Siamese_transformer
Pairwise Preference Learning with Siamese Transformer Encoders |
|
Experimental |
| 99 |
ipoukoumondi/IWR-Bench
🌐 Evaluate LVLMs' ability to reconstruct dynamic, interactive webpages from... |
|
Experimental |
| 100 |
Soumya-Chakraborty/VL-JEPA
VL-JEPA Joint Embedding Predictive Architecture for Vision-language... |
|
Experimental |
| 101 |
RobinDong/tiny_multimodal
Tiny and simple implementation of multimodal models |
|
Experimental |
| 102 |
google/crossmodal-3600
Crossmodal-3600 dataset |
|
Experimental |
| 103 |
holylovenia/awesome-multimodal-convai
Paper reading list for Multimodal Conversational AI |
|
Experimental |
| 104 |
YeLuoSuiYou/openstorypp
We introduce OpenStory++, a large-scale open-domain dataset focusing on... |
|
Experimental |
| 105 |
tristandb8/PyTorch-PaliGemma-2
PyTorch implementation of PaliGemma 2 |
|
Experimental |
| 106 |
ved1beta/Paligemma
vision language model |
|
Experimental |
| 107 |
aiden200/VLM_Implementation
Implementing a Video Language Model from scratch |
|
Experimental |
| 108 |
Hodasia/Awesome-Vision-Language-Finetune
Awesome List of Vision Language Prompt Papers |
|
Experimental |
| 109 |
alsaniie/Image-Similarity-Index-SSIM-analysis-
In image processing, an image similarity index, also known as a similarity... |
|
Experimental |
| 110 |
lyuchenyang/Semantic-aware-VideoQA
Code for ACL SRW 2023 paepr "Semantic-aware Dynamic... |
|
Experimental |
| 111 |
Dafterfly/Quick_Vilt
A CLI and GUI for using the Vision-and-Language Transformer (ViLT) model for... |
|
Experimental |
| 112 |
MichiganNLP/wildqa
WildQA website code |
|
Experimental |
| 113 |
lyuchenyang/Efficient-VideoQA
Code for ACL SustaiNLP 2023 paper "Is a Video worth n × n Images? A Highly... |
|
Experimental |
| 114 |
MohEsmail143/vizwiz-visual-question-answering
An implementation of the paper "Less is More", which was used to attempt the... |
|
Experimental |