facebookresearch/mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
This framework helps AI researchers quickly set up new projects that combine visual information (like images or videos) with text information (like captions or questions). It takes in datasets containing both images and related text, and outputs trained models capable of understanding and generating insights from this combined data. Researchers and machine learning engineers working on cutting-edge AI problems would use this.
5,622 stars. Actively maintained with 3 commits in the last 30 days.
Use this if you are an AI researcher starting a new project that involves analyzing or generating content from both images and text, and you need a robust, scalable foundation.
Not ideal if you are a practitioner looking for a ready-to-use application or a developer working on a non-AI project.
Stars
5,622
Forks
944
Language
Python
License
—
Category
Last pushed
Jan 12, 2026
Commits (30d)
3
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/facebookresearch/mmf"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
kyegomez/PALI3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
kyegomez/RT-X
Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...
chuanyangjin/MMToM-QA
[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering
kyegomez/PALM-E
Implementation of "PaLM-E: An Embodied Multimodal Language Model"
kyegomez/RT-2
Democratization of RT-2 "RT-2: New model translates vision and language into action"