aimagelab/ReT
[CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
Combines recurrent layers with vision-language transformers to capture fine-grained token-level interactions for robust multimodal document retrieval across diverse datasets. Integrates with Hugging Face model hub and FAISS for efficient indexing/searching, supporting both CLIP and OpenCLIP backbones (ViT-L/H/G). Introduces ReT-M2KR benchmark extending M2KR with passage images, enabling end-to-end training on multimodal queries and document pairs.
No commits in the last 6 months.
Stars
34
Forks
1
Language
Python
License
Apache-2.0
Category
Last pushed
Sep 12, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/aimagelab/ReT"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
holisticon/multimodal-rag-demo
🧠🖼️📄 Multimodal RAG Demo based on Qwen3-VL Embedding and Reranker models
debanjan06/geospatial-rag
AI Framework for Remote Sensing Image Analysis using RAG - 88%+ accuracy, multi-modal queries,...
berntpopp/phentrieve
AI-powered system for mapping clinical text to Human Phenotype Ontology (HPO) terms using...
hadil1999-creator/RAG_Hack_team
Our AI Financial Advisor is designed to revolutionize how users interact with financial and...
sasi123-stack/BioScholar-AI
BioScholar AI—an intelligent biomedical research engine powered by OpenClaw RAG and Llama 4...