facebookresearch/mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

/ 100

Established

This framework helps AI researchers quickly set up new projects that combine visual information (like images or videos) with text information (like captions or questions). It takes in datasets containing both images and related text, and outputs trained models capable of understanding and generating insights from this combined data. Researchers and machine learning engineers working on cutting-edge AI problems would use this.

5,622 stars. Actively maintained with 3 commits in the last 30 days.

Use this if you are an AI researcher starting a new project that involves analyzing or generating content from both images and text, and you need a robust, scalable foundation.

Not ideal if you are a practitioner looking for a ready-to-use application or a developer working on a non-AI project.

AI research computer vision natural language processing multimodal learning machine learning engineering

No Package No Dependents

Maintenance 9 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 23 / 25

How are scores calculated?

Stars

5,622

Forks

944

Language

Python

License

—

Related models

kyegomez/PALI3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"

kyegomez/RT-X

Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...

chuanyangjin/MMToM-QA

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering

kyegomez/PALM-E

Implementation of "PaLM-E: An Embodied Multimodal Language Model"

kyegomez/RT-2

Democratization of RT-2 "RT-2: New model translates vision and language into action"

Explore Transformer Models

All categories Trending Transformer directory Insights