TIGER-AI-Lab/VLM2Vec

This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]

/ 100

Emerging

Extends unified multimodal embeddings to videos and visual documents via instruction-guided contrastive training on Qwen2-VL backbones, enabling cross-modal retrieval and classification across diverse visual formats. The framework introduces MMEB-V2, a 78-task benchmark spanning image, video, and document modalities for systematic evaluation. Integrates with Hugging Face for model checkpoints and datasets, and has been integrated into vLLM for production inference.

592 stars.

No Package No Dependents

Maintenance 13 / 25

Adoption 10 / 25

Maturity 9 / 25

Community 16 / 25

How are scores calculated?

Stars

592

Forks

Language

Python

License

Apache-2.0

Category

multimodal-rag-systems

Last pushed

Mar 09, 2026

Commits (30d)

GitHub

Multimodal Rag Systems · 3 models

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/TIGER-AI-Lab/VLM2Vec"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

Higher-rated alternatives

lightonai/pylate

Late Interaction Models Training & Retrieval

Jorffy/NoteMR

[CVPR 2025] Code for "Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual...

Explore Transformer Models

All categories Trending Transformer directory Insights