Vision Language Models

Tools and implementations for multimodal AI models that combine vision and language processing for tasks like VQA, image captioning, and visual reasoning. Does NOT include general multimodal fusion, text-to-image generation, or single-modality models.

There are 56 vision language models tracked. The highest-rated is kyegomez/RT-X at 47/100 with 237 stars.

Get all 56 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=vision-language-models&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 kyegomez/RT-X

Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open...

47
Emerging
2 kyegomez/PALI3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS:...

44
Emerging
3 chuanyangjin/MMToM-QA

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind...

40
Emerging
4 kyegomez/PALM-E

Implementation of "PaLM-E: An Embodied Multimodal Language Model"

38
Emerging
5 ahmetkumass/yolo-gen

Train YOLO + VLM with one command. Auto-generate vision-language training...

38
Emerging
6 Muennighoff/vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

38
Emerging
7 lyuchenyang/Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text...

38
Emerging
8 kyegomez/RT-2

Democratization of RT-2 "RT-2: New model translates vision and language into action"

38
Emerging
9 kyegomez/qformer

Implementation of Qformer from BLIP2 in Zeta Lego blocks.

35
Emerging
10 princeton-nlp/CharXiv

[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in...

34
Emerging
11 kyegomez/MGQA

The open source implementation of the multi grouped query attention by the...

34
Emerging
12 kyegomez/MM1

PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from...

33
Emerging
13 kyegomez/SSM-As-VLM-Bridge

An exploration into leveraging SSM's as Bridge/Adapter Layers for VLM

33
Emerging
14 alantess/gtrxl-torch

Gated Transformer Model for Computer Vision

33
Emerging
15 amazon-science/crossmodal-contrastive-learning

CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video...

32
Emerging
16 SuyogKamble/simpleVLM

building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2...

30
Emerging
17 DestroyerDarkNess/fastvlm-webgpu

Real-time video captioning powered by FastVLM

30
Emerging
18 kyegomez/PALI

Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"

29
Experimental
19 SCZwangxiao/RTQ-MM2023

ACM Multimedia 2023 (Oral) - RTQ: Rethinking Video-language Understanding...

28
Experimental
20 deepmancer/vlm-toolbox

Vision-Language Models Toolbox: Your all-in-one solution for multimodal...

28
Experimental
21 ziqipang/RandAR

[CVPR 2025 (Oral)] Open implementation of "RandAR"

28
Experimental
22 logic-OT/BobVLM

BobVLM – A 1.5B multimodal model built from scratch and pre-trained on a...

28
Experimental
23 YeonwooSung/vision-search

Image search engine

27
Experimental
24 kyegomez/MobileVLM

Implementation of the LDP module block in PyTorch and Zeta from the paper:...

27
Experimental
25 zerovl/ZeroVL

[ECCV2022] Contrastive Vision-Language Pre-training with Limited Resources

27
Experimental
26 kyegomez/MMCA

The open source community's implementation of the all-new Multi-Modal Causal...

26
Experimental
27 ola-krutrim/Chitrarth

Chitrarth: Bridging Vision and Language for a Billion People

25
Experimental
28 Skyline-9/Visionary-Vids

Multi-modal transformer approach for natural language query based joint...

24
Experimental
29 HLTCHKUST/VG-GPLMs

The code repository for EMNLP 2021 paper "Vision Guided Generative...

24
Experimental
30 zalkklop/LVSM

Official code for "LVSM: A Large View Synthesis Model with Minimal 3D...

23
Experimental
31 krohling/nl-act

Integrating Natural Language Instructions into the Action Chunking...

22
Experimental
32 eltoto1219/vltk

A toolkit for vision-language processing to support the increasing...

22
Experimental
33 ChartMimic/ChartMimic

[ICLR 2025] ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability...

22
Experimental
34 declare-lab/MM-Align

[EMNLP 2022] This repository contains the official implementation of the...

22
Experimental
35 kaylode/vqa-transformer

Visual Question Answering using Transformer and Bottom-Up attention....

21
Experimental
36 vonexel/smog

Pytorch implementation of Semantic Motion Generation - 3D-motion synthesis...

21
Experimental
37 o-messai/fastVLM

An implementation of FastVLM/LLaVA or any llm/vlm model using FastAPI...

19
Experimental
38 kyegomez/MultiModalCrossAttn

The open source implementation of the cross attention mechanism from the...

19
Experimental
39 baohuyvanba/Vision-Zephyr

Vision-Zephyr: a multimodal LLM for Visual Commonsense Reasoning—CLIP-ViT +...

17
Experimental
40 Victorwz/VaLM

VaLM: Visually-augmented Language Modeling. ICLR 2023.

16
Experimental
41 AIDC-AI/Wings

The code repository for "Wings: Learning Multimodal LLMs without Text-only...

16
Experimental
42 shreydan/VLM-OD

experimental: finetune smolVLM on COCO (without any special tokens)

16
Experimental
43 wklee610/VLM-Model-fastapi

A reusable FastAPI module for serving and integrating Vision-Language Models (VLM)

16
Experimental
44 TheMasterOfDisasters/SmolVLM

SmolVLM WebUI & API – Easy-to-Run Vision-Language Model

16
Experimental
45 E1ims/math-vlm-finetune-pipeline

📐 Transcribe handwritten math into accurate LaTeX using a modular...

16
Experimental
46 buhsnn/Vision-Language-Model

Vision-language model combining a ResNet18 vision encoder with a GPT-2...

15
Experimental
47 MaxLSB/mini-paligemma2

Minimalist implementation of PaliGemma 2 & PaliGemma VLM from scratch

14
Experimental
48 PRITHIVSAKTHIUR/Doc-VLMs-exp

An experimental document-focused Vision-Language Model application that...

14
Experimental
49 michelecafagna26/VinVL

Original VinVL (and Oscar) repo with API designed for an easy inference

13
Experimental
50 telota/imagines-nummorum-vlm-data-extraction

A computer vision system for automated analysis of index cards from a...

13
Experimental
51 XavierSpycy/CAT-ImageTextIntegrator

An innovative deep learning framework leveraging the CAT (Convolutions,...

12
Experimental
52 Soheil-jafari/Language-Guided-Endoscopy-Localization

Open-vocabulary temporal localization in endoscopic video with...

12
Experimental
53 orshkuri/vqa-qformer-comparison

A benchmark and analysis of QFormer, Cross Attention, and Concat models for...

12
Experimental
54 tejas-54/Visual-Search-Engine-Using-VLM

Visual Search Engine using VLM (Vision-Language Model) A...

11
Experimental
55 Hardhik-Poosa/Drone_Swarm

AI-powered drone swarm simulator that converts images into optimized 2D and...

11
Experimental
56 ab3llini/Transformer-VQA

Transformer-based VQA system capable of generating unconstrained, open-ended...

10
Experimental