inuwamobarak/Image-captioning-ViT

Image Captioning Vision Transformers (ViTs) are transformer models that generate descriptive captions for images by combining the power of Transformers and computer vision. It leverages state-of-the-art pre-trained ViT models and employs technique

26
/ 100
Experimental

The architecture combines a pre-trained ViT encoder for image feature extraction with a transformer-based decoder for caption generation, employing transfer learning to reduce training overhead. It includes finetuning capabilities for custom datasets and evaluates output quality using standard metrics like BLEU, METEOR, and CIDEr. Built on PyTorch and the Hugging Face Transformers library, it also integrates LitServe for deploying the model as a production-ready inference server.

No commits in the last 6 months.

No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 7 / 25
Maturity 8 / 25
Community 11 / 25

How are scores calculated?

Stars

40

Forks

5

Language

Jupyter Notebook

License

Category

image-captioning

Last pushed

Oct 14, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/generative-ai/inuwamobarak/Image-captioning-ViT"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.