Vision Language Models ML Frameworks

Frameworks and implementations for multimodal models that combine vision and language capabilities, including vision-language transformers, image-text generation, and visual question answering systems. Does NOT include single-modality models, general computer vision frameworks, or task-specific applications like document OCR or license plate recognition.

There are 114 vision language models frameworks tracked. 3 score above 50 (established tier). The highest-rated is facebookresearch/mmf at 62/100 with 5,622 stars. 1 of the top 10 are actively maintained.

Get all 114 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=vision-language-models&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 facebookresearch/mmf

A modular framework for vision & language multimodal research from Facebook...

62
Established
2 open-mmlab/mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark

60
Established
3 adambielski/siamese-triplet

Siamese and triplet networks with online pair/triplet mining in PyTorch

51
Established
4 pliang279/awesome-multimodal-ml

Reading list for research topics in multimodal machine learning

48
Emerging
5 friedrichor/Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

46
Emerging
6 mlfoundations/open_flamingo

An open-source framework for training large multimodal models.

45
Emerging
7 HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis

Papers, code and datasets about deep learning and multi-modal learning for...

44
Emerging
8 KaiyangZhou/pytorch-vsumm-reinforce

Unsupervised video summarization with deep reinforcement learning (AAAI'18)

44
Emerging
9 jingyi0000/VLM_survey

Collection of AWESOME vision-language models for vision tasks

43
Emerging
10 kuanghuei/SCAN

PyTorch source code for "Stacked Cross Attention for Image-Text Matching" (ECCV 2018)

43
Emerging
11 kyegomez/HRTX

Multi-Modal Multi-Embodied Hivemind-like Iteration of RTX-2

43
Emerging
12 codebyshibsankar/image_triplet_loss

Image similarity using Triplet Loss

42
Emerging
13 batra-mlp-lab/visdial

[CVPR 2017] Torch code for Visual Dialog

42
Emerging
14 kezhang-cs/Video-Summarization-with-LSTM

Implementation of our ECCV 2016 Paper (Video Summarization with Long...

41
Emerging
15 pliang279/MultiBench

[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning

40
Emerging
16 vbalnt/tfeat

TFeat descriptor models for BMVC 2016 paper "Learning local feature...

40
Emerging
17 willxxy/awesome-mmps

Corpus of resources for multimodal machine learning with physiological...

39
Emerging
18 kyegomez/Med-PaLM

Towards Generalist Biomedical AI

38
Emerging
19 kyegomez/Fuyu

Implementation of Adepts Fuyu all-new Multi-Modality model in pytorch

38
Emerging
20 nekhtiari/image-similarity-measures

:chart_with_upwards_trend: Implementation of eight evaluation metrics to...

38
Emerging
21 landskape-ai/triplet-attention

Official PyTorch Implementation for "Rotate to Attend: Convolutional Triplet...

37
Emerging
22 Cloud-CV/VQA

CloudCV Visual Question Answering Demo

37
Emerging
23 thubZ09/vision-language-model-research

Hub for researchers exploring VLMs and Multimodal Learning:)

36
Emerging
24 OpenBioLink/ThoughtSource

A central, open resource for data and tools related to chain-of-thought...

36
Emerging
25 Cadene/vqa.pytorch

Visual Question Answering in Pytorch

36
Emerging
26 aioz-ai/CFR_VQA

Coarse-to-Fine Reasoning for Visual Question Answering (CVPRW'22)

35
Emerging
27 thuiar/MIntRec

MIntRec: A New Dataset for Multimodal Intent Recognition (ACM MM 2022)

35
Emerging
28 maruya24/pytorch_robotics_transformer

A PyTorch re-implementation of the RT-1 (Robotics Transformer)

35
Emerging
29 ManifoldRG/NEKO

Implementation of GATO style Generalist Multimodal model capable of image,...

34
Emerging
30 yuanze-lin/REVIVE

[NeurIPS 2022] Official code for REVIVE: Regional Visual Representation...

34
Emerging
31 abhshkdz/neural-vqa

:grey_question: Visual Question Answering in Torch

34
Emerging
32 thswodnjs3/CSTA

The official code of "CSTA: CNN-based Spatiotemporal Attention for Video...

33
Emerging
33 monjurulkarim/DSTA

This is the implementation code for the paper, "A Dynamic Spatial-temporal...

33
Emerging
34 aioz-ai/MICCAI21_MMQ

Multiple Meta-model Quantifying for Medical Visual Question Answering (MICCAI 2021)

33
Emerging
35 mlbio-epfl/joint-inference

[ICLR 2025] Large (Vision) Language Models are Unsupervised In-Context Learners

33
Emerging
36 IBM/AdaMML

Official implementation of AdaMML. https://arxiv.org/abs/2105.05165.

33
Emerging
37 abhshkdz/neural-vqa-attention

:question: Attention-based Visual Question Answering in Torch

31
Emerging
38 neulab/CulturalGround

This repository provides the official resources for EMNLP 2025 Paper...

31
Emerging
39 TIGER-AI-Lab/VideoScore

official repo for "VideoScore: Building Automatic Metrics to Simulate...

31
Emerging
40 subho406/OmniNet

Official Pytorch implementation of "OmniNet: A unified architecture for...

29
Experimental
41 williamcfrancis/Visual-Question-Answering-using-Stacked-Attention-Networks

Pytorch implementation of VQA using Stacked Attention Networks: Multimodal...

29
Experimental
42 zchuz/CoT-Reasoning-Survey

[ACL 2024] A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future

29
Experimental
43 RManLuo/MAMDR

Official code implementation for ICDE 23 paper MAMDR: A Model Agnostic...

28
Experimental
44 fansunqi/VideoTool

Official Repository for NeurIPS'25 Paper "Tool-Augmented Spatiotemporal...

28
Experimental
45 etornam45/vl-jepa

This VL-JEPA implimentation takes direct insperation from the original VL-JEPA paper

28
Experimental
46 real-stanford/semantic-abstraction

[CoRL 2022] This repository contains code for generating relevancies,...

28
Experimental
47 pliang279/MultiViz

[ICLR 2023] MultiViz: Towards Visualizing and Understanding Multimodal Models

27
Experimental
48 invictus717/MiCo

[ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale

27
Experimental
49 tgxs002/wikiscenes

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning...

27
Experimental
50 pranv/ARC

Code for Attentive Recurrent Comparators

27
Experimental
51 nerdimite/neuro-symbolic-ai-soc

Neuro-Symbolic Visual Question Answering on Sort-of-CLEVR using PyTorch

27
Experimental
52 Jakobovski/decoupled-multimodal-learning

A decoupled, generative, unsupervised, multimodal neural architecture.

26
Experimental
53 AlwaysFHao/TiM4Rec

[Neurocomputing 2025] The code for the paper "TiM4Rec: An Efficient...

26
Experimental
54 Skyyyy0920/MTNet

Code implementation for our paper "Learning Time Slot Preferences via...

25
Experimental
55 imneonizer/pytorch-triplet-loss

Birds 400-Species Image Classification using Pytorch Metric Learning...

25
Experimental
56 kyegomez/AutoRT

Implementation of AutoRT: "AutoRT: Embodied Foundation Models for Large...

25
Experimental
57 tensorpix/benchmarking-cv-models

Benchmark computer vision ML models in 3 minutes

25
Experimental
58 Soumya-Chakraborty/Unsupervised-video-summarization-with-deep-GAN-reinforcement-learning

Unsupervised video summarization with deep(GAN) reinforcement learning

25
Experimental
59 Rishit-dagli/Astroformer

This repository contains the official implementation of Astroformer, an ICLR...

25
Experimental
60 le-liang/Multimodal-Wireless

Python scripts and assets related to Multimodal-Wireless dataset. The...

25
Experimental
61 kyegomez/MMCA-MGQA

Experiments around using Multi-Modal Casual Attention with Multi-Grouped...

24
Experimental
62 cpystan/WSI-VQA

[ECCV 2024] Official Implementation of 《WSI-VQA: Interpreting Whole Slide...

23
Experimental
63 AceCHQ/MMIQ

This repo contains evaluation code for MM-IQ benchmark.

23
Experimental
64 VectorInstitute/VLDBench

VLDBench: A large-scale benchmark for evaluating Vision-Language Models...

23
Experimental
65 ViLab-UCSD/LaGTran_ICML2024

Code and models for the ICML 2024 paper "Tell, Don`t Show!: Language...

23
Experimental
66 liveseongho/Awesome-Video-Language-Understanding

A Survey on video and language understanding.

22
Experimental
67 ntkhoa95/multimodal-for-vision

Vision Framework: A modular multi-agent system for computer vision tasks,...

22
Experimental
68 iluvn01/VFMTok

🖼️ Leverage vision foundation models to transform visual data into effective...

22
Experimental
69 raminguyen/LLMP2

Evaluating ‘Graphical Perception’ with Multimodal Large Language Models

22
Experimental
70 Peachypie98/CBAM

CBAM: Convolutional Block Attention Module for CIFAR100 on VGG19

22
Experimental
71 lilygeorgescu/MHCA

Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for...

22
Experimental
72 yousefkotp/Visual-Question-Answering

A Light weight deep learning model with with a web application to answer...

22
Experimental
73 vtu81/NaiveVQA

A Visual Question Answering model implemented in MindSpore and PyTorch. The...

21
Experimental
74 zamaex96/Hybrid-CNN-LSTM-with-Spatial-Attention

This documents the training and evaluation of a Hybrid CNN-LSTM Attention...

21
Experimental
75 RobotiXX/multimodal-fusion-network

This repository contains all the code for Parsing, Transforming and Training...

20
Experimental
76 VQA-Team/Visual-Question-Answering

The project is an Android application aimed to help the visually impaired by...

20
Experimental
77 schwettmann/visual-vocab

Pytorch-based tools for constructing a vocabulary of visual concepts in a GAN.

20
Experimental
78 uakarsh/med-vqa

An approach for solving the problem of medical visual question answering

20
Experimental
79 kyegomez/NeVA

The open source implementation of "NeVA: NeMo Vision and Language Assistant"

20
Experimental
80 kyegomez/MultiModal-ToT

Multi-Modal Tree of thoughts for DALLE-3 like auto self improvement

20
Experimental
81 rahuldevmuraleedharan/Neural-Navigator

Multi-modal Transformer that fuses vision and language to generate...

19
Experimental
82 yuhui-zh15/VLMClassifier

Official implementation of "Why are Visually-Grounded Language Models Bad at...

18
Experimental
83 naamiinepal/tunevlseg

[ACCV 2024]: TuneVLSeg: Prompt Tuning Benchmark for Vision-Language...

18
Experimental
84 projectayre/ayre

Visual Question Answering with added novel Semantic Analysis approach....

17
Experimental
85 fansunqi/AKeyS

Agentic Keyframe Search for Video Question Answering

17
Experimental
86 clear-nus/MuMMI

Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised...

17
Experimental
87 aaaastark/hybrid-model-with-cnn-lstm-python

Hybrid Model with CNN and LSTM for VMD dataset using Python

16
Experimental
88 cronenberg64/VLM-arch

Systematic benchmarking of modern vision backbones under small-data...

16
Experimental
89 SriramPingali/Multi-Modal-Recommendation-System

Official code for the paper "Towards developing a Multi Modal Video...

16
Experimental
90 rkl71/MambaRec

[CIKM 2025] Source code for "Modality Alignment with Multi-scale Bilateral...

16
Experimental
91 ankitsharma-tech/Image-Triplet-Loss

Image similarity using Triplet Loss.

16
Experimental
92 guyyariv/vLMIG

This repo contains the official PyTorch implementation of vLMIG: Improving...

15
Experimental
93 kyegomez/VisionLLaMA

Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA...

15
Experimental
94 jesusp1234/multimodal-benchmarks

🎯 Benchmark retrieval systems across video, image, audio, and documents with...

14
Experimental
95 anggaumhar/dynamicvl

🌆 Benchmark multimodal large language models to enhance understanding of...

14
Experimental
96 anujanegi/VQA

Visual Question Answering System

14
Experimental
97 darkmax159159357/TypeR-models

⚠️ DEPRECATED — Merged into darkmax159159357/TypeR. See main repo for all...

14
Experimental
98 soominmyung/Pairwise_Siamese_transformer

Pairwise Preference Learning with Siamese Transformer Encoders

14
Experimental
99 ipoukoumondi/IWR-Bench

🌐 Evaluate LVLMs' ability to reconstruct dynamic, interactive webpages from...

14
Experimental
100 Soumya-Chakraborty/VL-JEPA

VL-JEPA Joint Embedding Predictive Architecture for Vision-language...

13
Experimental
101 RobinDong/tiny_multimodal

Tiny and simple implementation of multimodal models

13
Experimental
102 google/crossmodal-3600

Crossmodal-3600 dataset

13
Experimental
103 holylovenia/awesome-multimodal-convai

Paper reading list for Multimodal Conversational AI

12
Experimental
104 YeLuoSuiYou/openstorypp

We introduce OpenStory++, a large-scale open-domain dataset focusing on...

12
Experimental
105 tristandb8/PyTorch-PaliGemma-2

PyTorch implementation of PaliGemma 2

12
Experimental
106 ved1beta/Paligemma

vision language model

12
Experimental
107 aiden200/VLM_Implementation

Implementing a Video Language Model from scratch

12
Experimental
108 Hodasia/Awesome-Vision-Language-Finetune

Awesome List of Vision Language Prompt Papers

12
Experimental
109 alsaniie/Image-Similarity-Index-SSIM-analysis-

In image processing, an image similarity index, also known as a similarity...

12
Experimental
110 lyuchenyang/Semantic-aware-VideoQA

Code for ACL SRW 2023 paepr "Semantic-aware Dynamic...

12
Experimental
111 Dafterfly/Quick_Vilt

A CLI and GUI for using the Vision-and-Language Transformer (ViLT) model for...

12
Experimental
112 MichiganNLP/wildqa

WildQA website code

11
Experimental
113 lyuchenyang/Efficient-VideoQA

Code for ACL SustaiNLP 2023 paper "Is a Video worth n × n Images? A Highly...

11
Experimental
114 MohEsmail143/vizwiz-visual-question-answering

An implementation of the paper "Less is More", which was used to attempt the...

10
Experimental