Image Captioning Transformers Transformer Models

Tools for generating textual descriptions from images and videos using transformer-based encoder-decoder architectures. Includes image-to-text, video captioning, and dense captioning systems. Does NOT include general vision-language models for other tasks (VQA, retrieval), text-to-image generation, or vision-only feature extraction.

There are 28 image captioning transformers models tracked. The highest-rated is zarzouram/image_captioning_with_transformers at 38/100 with 68 stars.

Get all 28 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=image-captioning-transformers&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 zarzouram/image_captioning_with_transformers

Pytorch implementation of image captioning using transformer-based model.

38
Emerging
2 rese1f/aurora

[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a...

36
Emerging
3 senadkurtisi/pytorch-image-captioning

Transformer & CNN Image Captioning model in PyTorch.

35
Emerging
4 tanishqgautam/Image-Captioning

Implemented 3 different architectures to tackle the Image Caption problem,...

33
Emerging
5 ilya16/deephumor

DeepHumor: Image-based Meme Generation using Deep Learning

27
Experimental
6 Hamtech-ai/Persian-Image-Captioning

A Persian Image Captioning model based on Vision Encoder Decoder Models of...

27
Experimental
7 tojiboyevf/image_captioning

Deep Learning Final project 2022

27
Experimental
8 slSeanWU/beats-conformer-bart-audio-captioner

PyTorch implementation of the ICASSP-24 paper: "Improving Audio Captioning...

26
Experimental
9 shreydan/VisionGPT2

Combining ViT and GPT-2 for image captioning. Trained on MS-COCO. The model...

23
Experimental
10 abhijitpal1247/image-mix-with-controlnet

A sample project to test out the features of streamlit. Provides a way to...

20
Experimental
11 farukalamai/background-removal-birefnet

Background Removal Application using BiRefNet

18
Experimental
12 Technolog796/image_captioning

Создание русскоязычной модели для image captioning

17
Experimental
13 Devnetly/image-captioning

Image captioning model & application based on transformers.

17
Experimental
14 nateraw/discord-image-captioning-bot

A Discord bot for captioning images

17
Experimental
15 PRITHIVSAKTHIUR/Florence-2-Image-Caption

This application utilizes the powerful Florence-2 vision-language model from...

17
Experimental
16 therrshan/image-captioning

Comparitive analysis of image captioning model using RNN, BiLSTM and...

16
Experimental
17 vishaln15/roco-image-captioning

Enhanced Image Captioning on ROCO Multimodal dataset using step-by-step distillation

16
Experimental
18 AHMEDSANA/Image-Captioning-with-ViT-and-BERT

A concise image-captioning pipeline that fine-tunes a ViT encoder with a...

12
Experimental
19 anto18671/image-to-dense-caption

Generate vivid, human-like captions for portrait images using the...

12
Experimental
20 karroge10/Loomi-Clothing-Detection-API

AI clothing detection API with segmentation, background removal, and color...

12
Experimental
21 theSohamTUmbare/DETR_powered_Image_Captioning

The excellent Image captioning model using the DETR inspired architecture

12
Experimental
22 Merterm/COSMic

Public repo for the paper: "COSMic: A Coherence-Aware Generation Metric for...

12
Experimental
23 jshwanth/image-captioning

Error-centric comparison of CNN-LSTM, attention-based, and transformer...

12
Experimental
24 Mahmood-Anaam/violet

Violet: A Vision-Language model for generating Arabic image captions using a...

11
Experimental
25 Riya-l209/ImageCaptioning_Segmentation

AI-powered Image Captioning & Segmentation | ViT-GPT2 + Mask R-CNN |...

11
Experimental
26 sharpsalt/Captionforge-Multimodal-Image-Captioning-System

This PyTorch-based image captioning model uses ResNet-50 encoder and...

11
Experimental
27 Akhan521/Snaption

📸 My first deep dive into multi-modal ML! Built an end-to-end image...

11
Experimental
28 suryanshgupta9933/Scene-Script

An image to text model/pipeline using VIT and Transformers and deployment...

11
Experimental