aimagelab/ReT

[CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

28
/ 100
Experimental

Combines recurrent layers with vision-language transformers to capture fine-grained token-level interactions for robust multimodal document retrieval across diverse datasets. Integrates with Hugging Face model hub and FAISS for efficient indexing/searching, supporting both CLIP and OpenCLIP backbones (ViT-L/H/G). Introduces ReT-M2KR benchmark extending M2KR with passage images, enabling end-to-end training on multimodal queries and document pairs.

No commits in the last 6 months.

Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 7 / 25
Maturity 16 / 25
Community 3 / 25

How are scores calculated?

Stars

34

Forks

1

Language

Python

License

Apache-2.0

Last pushed

Sep 12, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/aimagelab/ReT"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.