Image Captioning Transformers Transformer Models

Tools for generating textual descriptions from images and videos using transformer-based encoder-decoder architectures. Includes image-to-text, video captioning, and dense captioning systems. Does NOT include general vision-language models for other tasks (VQA, retrieval), text-to-image generation, or vision-only feature extraction.

There are 28 image captioning transformers models tracked. The highest-rated is zarzouram/image_captioning_with_transformers at 38/100 with 68 stars.

Get all 28 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=image-captioning-transformers&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	zarzouram/image_captioning_with_transformers Pytorch implementation of image captioning using transformer-based model.	38	Emerging	68	Jupyter Notebook
2	rese1f/aurora [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a...	36	Emerging	139	Python
3	senadkurtisi/pytorch-image-captioning Transformer & CNN Image Captioning model in PyTorch.	35	Emerging	44	Python
4	tanishqgautam/Image-Captioning Implemented 3 different architectures to tackle the Image Caption problem,...	33	Emerging	40	Jupyter Notebook
5	ilya16/deephumor DeepHumor: Image-based Meme Generation using Deep Learning	27	Experimental	34	Jupyter Notebook
6	Hamtech-ai/Persian-Image-Captioning A Persian Image Captioning model based on Vision Encoder Decoder Models of...	27	Experimental	20	Jupyter Notebook
7	tojiboyevf/image_captioning Deep Learning Final project 2022	27	Experimental	4	Python
8	slSeanWU/beats-conformer-bart-audio-captioner PyTorch implementation of the ICASSP-24 paper: "Improving Audio Captioning...	26	Experimental	39	Jupyter Notebook
9	shreydan/VisionGPT2 Combining ViT and GPT-2 for image captioning. Trained on MS-COCO. The model...	23	Experimental	49	Jupyter Notebook
10	abhijitpal1247/image-mix-with-controlnet A sample project to test out the features of streamlit. Provides a way to...	20	Experimental	7	Jupyter Notebook
11	farukalamai/background-removal-birefnet Background Removal Application using BiRefNet	18	Experimental	3	JavaScript
12	Technolog796/image_captioning Создание русскоязычной модели для image captioning	17	Experimental	5	Jupyter Notebook
13	Devnetly/image-captioning Image captioning model & application based on transformers.	17	Experimental	5	Jupyter Notebook
14	nateraw/discord-image-captioning-bot A Discord bot for captioning images	17	Experimental	5	Python
15	PRITHIVSAKTHIUR/Florence-2-Image-Caption This application utilizes the powerful Florence-2 vision-language model from...	17	Experimental	6	Python
16	therrshan/image-captioning Comparitive analysis of image captioning model using RNN, BiLSTM and...	16	Experimental	4	Python
17	vishaln15/roco-image-captioning Enhanced Image Captioning on ROCO Multimodal dataset using step-by-step distillation	16	Experimental	1	Jupyter Notebook
18	AHMEDSANA/Image-Captioning-with-ViT-and-BERT A concise image-captioning pipeline that fine-tunes a ViT encoder with a...	12	Experimental	1	Jupyter Notebook
19	anto18671/image-to-dense-caption Generate vivid, human-like captions for portrait images using the...	12	Experimental	1	Python
20	karroge10/Loomi-Clothing-Detection-API AI clothing detection API with segmentation, background removal, and color...	12	Experimental	1	Python
21	theSohamTUmbare/DETR_powered_Image_Captioning The excellent Image captioning model using the DETR inspired architecture	12	Experimental	1	Python
22	Merterm/COSMic Public repo for the paper: "COSMic: A Coherence-Aware Generation Metric for...	12	Experimental	4	Python
23	jshwanth/image-captioning Error-centric comparison of CNN-LSTM, attention-based, and transformer...	12	Experimental	1	Jupyter Notebook
24	Mahmood-Anaam/violet Violet: A Vision-Language model for generating Arabic image captions using a...	11	Experimental	—	Jupyter Notebook
25	Riya-l209/ImageCaptioning_Segmentation AI-powered Image Captioning & Segmentation \| ViT-GPT2 + Mask R-CNN \|...	11	Experimental	—	Jupyter Notebook
26	sharpsalt/Captionforge-Multimodal-Image-Captioning-System This PyTorch-based image captioning model uses ResNet-50 encoder and...	11	Experimental	—	Python
27	Akhan521/Snaption 📸 My first deep dive into multi-modal ML! Built an end-to-end image...	11	Experimental	—	Python
28	suryanshgupta9933/Scene-Script An image to text model/pipeline using VIT and Transformers and deployment...	11	Experimental	2	Python

Comparisons in this category

image_captioning_with_transformers and pytorch-image-captioning (38 vs 35)