Vision Language Models

Tools and implementations for multimodal AI models that combine vision and language processing for tasks like VQA, image captioning, and visual reasoning. Does NOT include general multimodal fusion, text-to-image generation, or single-modality models.

There are 56 vision language models tracked. The highest-rated is kyegomez/RT-X at 47/100 with 237 stars.

Get all 56 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=vision-language-models&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	kyegomez/RT-X Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open...	47	Emerging	237	Python
2	kyegomez/PALI3 Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS:...	44	Emerging	146	Python
3	chuanyangjin/MMToM-QA [🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind...	40	Emerging	154	Python
4	kyegomez/PALM-E Implementation of "PaLM-E: An Embodied Multimodal Language Model"	38	Emerging	335	Python
5	ahmetkumass/yolo-gen Train YOLO + VLM with one command. Auto-generate vision-language training...	38	Emerging	24	Python
6	Muennighoff/vilio 🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle	38	Emerging	91	Python
7	lyuchenyang/Macaw-LLM Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text...	38	Emerging	1,593	Python
8	kyegomez/RT-2 Democratization of RT-2 "RT-2: New model translates vision and language into action"	38	Emerging	554	Python
9	kyegomez/qformer Implementation of Qformer from BLIP2 in Zeta Lego blocks.	35	Emerging	48	Python
10	princeton-nlp/CharXiv [NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in...	34	Emerging	142	Python
11	kyegomez/MGQA The open source implementation of the multi grouped query attention by the...	34	Emerging	15	Python
12	kyegomez/MM1 PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from...	33	Emerging	26	Python
13	kyegomez/SSM-As-VLM-Bridge An exploration into leveraging SSM's as Bridge/Adapter Layers for VLM	33	Emerging	2	Python
14	alantess/gtrxl-torch Gated Transformer Model for Computer Vision	33	Emerging	25	Python
15	amazon-science/crossmodal-contrastive-learning CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video...	32	Emerging	64	Python
16	SuyogKamble/simpleVLM building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2...	30	Emerging	7	Jupyter Notebook
17	DestroyerDarkNess/fastvlm-webgpu Real-time video captioning powered by FastVLM	30	Emerging	4	JavaScript
18	kyegomez/PALI Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"	29	Experimental	94	Python
19	SCZwangxiao/RTQ-MM2023 ACM Multimedia 2023 (Oral) - RTQ: Rethinking Video-language Understanding...	28	Experimental	16	Python
20	deepmancer/vlm-toolbox Vision-Language Models Toolbox: Your all-in-one solution for multimodal...	28	Experimental	12	Jupyter Notebook
21	ziqipang/RandAR [CVPR 2025 (Oral)] Open implementation of "RandAR"	28	Experimental	207	Python
22	logic-OT/BobVLM BobVLM – A 1.5B multimodal model built from scratch and pre-trained on a...	28	Experimental	11	Python
23	YeonwooSung/vision-search Image search engine	27	Experimental	6	TypeScript
24	kyegomez/MobileVLM Implementation of the LDP module block in PyTorch and Zeta from the paper:...	27	Experimental	15	Python
25	zerovl/ZeroVL [ECCV2022] Contrastive Vision-Language Pre-training with Limited Resources	27	Experimental	46	Python
26	kyegomez/MMCA The open source community's implementation of the all-new Multi-Modal Causal...	26	Experimental	11	Python
27	ola-krutrim/Chitrarth Chitrarth: Bridging Vision and Language for a Billion People	25	Experimental	13	Python
28	Skyline-9/Visionary-Vids Multi-modal transformer approach for natural language query based joint...	24	Experimental	17	Jupyter Notebook
29	HLTCHKUST/VG-GPLMs The code repository for EMNLP 2021 paper "Vision Guided Generative...	24	Experimental	57	Python
30	zalkklop/LVSM Official code for "LVSM: A Large View Synthesis Model with Minimal 3D...	23	Experimental	1	Python
31	krohling/nl-act Integrating Natural Language Instructions into the Action Chunking...	22	Experimental	9	Python
32	eltoto1219/vltk A toolkit for vision-language processing to support the increasing...	22	Experimental	1	HTML
33	ChartMimic/ChartMimic [ICLR 2025] ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability...	22	Experimental	131	Python
34	declare-lab/MM-Align [EMNLP 2022] This repository contains the official implementation of the...	22	Experimental	33	Python
35	kaylode/vqa-transformer Visual Question Answering using Transformer and Bottom-Up attention....	21	Experimental	10	Python
36	vonexel/smog Pytorch implementation of Semantic Motion Generation - 3D-motion synthesis...	21	Experimental	2	Python
37	o-messai/fastVLM An implementation of FastVLM/LLaVA or any llm/vlm model using FastAPI...	19	Experimental	5	TypeScript
38	kyegomez/MultiModalCrossAttn The open source implementation of the cross attention mechanism from the...	19	Experimental	37	Python
39	baohuyvanba/Vision-Zephyr Vision-Zephyr: a multimodal LLM for Visual Commonsense Reasoning—CLIP-ViT +...	17	Experimental	2	Python
40	Victorwz/VaLM VaLM: Visually-augmented Language Modeling. ICLR 2023.	16	Experimental	56	Python
41	AIDC-AI/Wings The code repository for "Wings: Learning Multimodal LLMs without Text-only...	16	Experimental	26	Python
42	shreydan/VLM-OD experimental: finetune smolVLM on COCO (without any special tokens)	16	Experimental	9	Jupyter Notebook
43	wklee610/VLM-Model-fastapi A reusable FastAPI module for serving and integrating Vision-Language Models (VLM)	16	Experimental	1	Python
44	TheMasterOfDisasters/SmolVLM SmolVLM WebUI & API – Easy-to-Run Vision-Language Model	16	Experimental	1	Python
45	E1ims/math-vlm-finetune-pipeline 📐 Transcribe handwritten math into accurate LaTeX using a modular...	16	Experimental	—	Jupyter Notebook
46	buhsnn/Vision-Language-Model Vision-language model combining a ResNet18 vision encoder with a GPT-2...	15	Experimental	1	Jupyter Notebook
47	MaxLSB/mini-paligemma2 Minimalist implementation of PaliGemma 2 & PaliGemma VLM from scratch	14	Experimental	13	Python
48	PRITHIVSAKTHIUR/Doc-VLMs-exp An experimental document-focused Vision-Language Model application that...	14	Experimental	4	Python
49	michelecafagna26/VinVL Original VinVL (and Oscar) repo with API designed for an easy inference	13	Experimental	8	Python
50	telota/imagines-nummorum-vlm-data-extraction A computer vision system for automated analysis of index cards from a...	13	Experimental	2	Python
51	XavierSpycy/CAT-ImageTextIntegrator An innovative deep learning framework leveraging the CAT (Convolutions,...	12	Experimental	3	Python
52	Soheil-jafari/Language-Guided-Endoscopy-Localization Open-vocabulary temporal localization in endoscopic video with...	12	Experimental	1	Python
53	orshkuri/vqa-qformer-comparison A benchmark and analysis of QFormer, Cross Attention, and Concat models for...	12	Experimental	1	Python
54	tejas-54/Visual-Search-Engine-Using-VLM Visual Search Engine using VLM (Vision-Language Model) A...	11	Experimental	—	Python
55	Hardhik-Poosa/Drone_Swarm AI-powered drone swarm simulator that converts images into optimized 2D and...	11	Experimental	—	Python
56	ab3llini/Transformer-VQA Transformer-based VQA system capable of generating unconstrained, open-ended...	10	Experimental	1	Python