kyegomez/RT-2
Democratization of RT-2 "RT-2: New model translates vision and language into action"
Combines a vision encoder with a PaLM-E language model backbone to embed images and language into a unified multimodal space, enabling end-to-end training on both web-scale and robotics datasets. The model outputs action tokens directly, allowing robot camera observations paired with natural language instructions to be translated into executable control commands. Includes PyTorch implementation with straightforward API for integration into robotics pipelines and fine-tuning on custom robot demonstration data.
554 stars. No commits in the last 6 months.
Stars
554
Forks
68
Language
Python
License
MIT
Category
Last pushed
Jul 26, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/kyegomez/RT-2"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
kyegomez/RT-X
Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...
kyegomez/PALI3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
chuanyangjin/MMToM-QA
[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering
kyegomez/PALM-E
Implementation of "PaLM-E: An Embodied Multimodal Language Model"
ahmetkumass/yolo-gen
Train YOLO + VLM with one command. Auto-generate vision-language training data from YOLO labels...