Image Captioning Transformers Transformer Models
Tools for generating textual descriptions from images and videos using transformer-based encoder-decoder architectures. Includes image-to-text, video captioning, and dense captioning systems. Does NOT include general vision-language models for other tasks (VQA, retrieval), text-to-image generation, or vision-only feature extraction.
There are 28 image captioning transformers models tracked. The highest-rated is zarzouram/image_captioning_with_transformers at 38/100 with 68 stars.
Get all 28 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=image-captioning-transformers&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
zarzouram/image_captioning_with_transformers
Pytorch implementation of image captioning using transformer-based model. |
|
Emerging |
| 2 |
rese1f/aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a... |
|
Emerging |
| 3 |
senadkurtisi/pytorch-image-captioning
Transformer & CNN Image Captioning model in PyTorch. |
|
Emerging |
| 4 |
tanishqgautam/Image-Captioning
Implemented 3 different architectures to tackle the Image Caption problem,... |
|
Emerging |
| 5 |
ilya16/deephumor
DeepHumor: Image-based Meme Generation using Deep Learning |
|
Experimental |
| 6 |
Hamtech-ai/Persian-Image-Captioning
A Persian Image Captioning model based on Vision Encoder Decoder Models of... |
|
Experimental |
| 7 |
tojiboyevf/image_captioning
Deep Learning Final project 2022 |
|
Experimental |
| 8 |
slSeanWU/beats-conformer-bart-audio-captioner
PyTorch implementation of the ICASSP-24 paper: "Improving Audio Captioning... |
|
Experimental |
| 9 |
shreydan/VisionGPT2
Combining ViT and GPT-2 for image captioning. Trained on MS-COCO. The model... |
|
Experimental |
| 10 |
abhijitpal1247/image-mix-with-controlnet
A sample project to test out the features of streamlit. Provides a way to... |
|
Experimental |
| 11 |
farukalamai/background-removal-birefnet
Background Removal Application using BiRefNet |
|
Experimental |
| 12 |
Technolog796/image_captioning
Создание русскоязычной модели для image captioning |
|
Experimental |
| 13 |
Devnetly/image-captioning
Image captioning model & application based on transformers. |
|
Experimental |
| 14 |
nateraw/discord-image-captioning-bot
A Discord bot for captioning images |
|
Experimental |
| 15 |
PRITHIVSAKTHIUR/Florence-2-Image-Caption
This application utilizes the powerful Florence-2 vision-language model from... |
|
Experimental |
| 16 |
therrshan/image-captioning
Comparitive analysis of image captioning model using RNN, BiLSTM and... |
|
Experimental |
| 17 |
vishaln15/roco-image-captioning
Enhanced Image Captioning on ROCO Multimodal dataset using step-by-step distillation |
|
Experimental |
| 18 |
AHMEDSANA/Image-Captioning-with-ViT-and-BERT
A concise image-captioning pipeline that fine-tunes a ViT encoder with a... |
|
Experimental |
| 19 |
anto18671/image-to-dense-caption
Generate vivid, human-like captions for portrait images using the... |
|
Experimental |
| 20 |
karroge10/Loomi-Clothing-Detection-API
AI clothing detection API with segmentation, background removal, and color... |
|
Experimental |
| 21 |
theSohamTUmbare/DETR_powered_Image_Captioning
The excellent Image captioning model using the DETR inspired architecture |
|
Experimental |
| 22 |
Merterm/COSMic
Public repo for the paper: "COSMic: A Coherence-Aware Generation Metric for... |
|
Experimental |
| 23 |
jshwanth/image-captioning
Error-centric comparison of CNN-LSTM, attention-based, and transformer... |
|
Experimental |
| 24 |
Mahmood-Anaam/violet
Violet: A Vision-Language model for generating Arabic image captions using a... |
|
Experimental |
| 25 |
Riya-l209/ImageCaptioning_Segmentation
AI-powered Image Captioning & Segmentation | ViT-GPT2 + Mask R-CNN |... |
|
Experimental |
| 26 |
sharpsalt/Captionforge-Multimodal-Image-Captioning-System
This PyTorch-based image captioning model uses ResNet-50 encoder and... |
|
Experimental |
| 27 |
Akhan521/Snaption
📸 My first deep dive into multi-modal ML! Built an end-to-end image... |
|
Experimental |
| 28 |
suryanshgupta9933/Scene-Script
An image to text model/pipeline using VIT and Transformers and deployment... |
|
Experimental |