keivalya/mini-vla

a minimal, beginner-friendly VLA to show how robot policies can fuse images, text, and states to generate actions

/ 100

Established

Implements diffusion-based action generation with separate encoders for vision (images), language (text instructions), and robot state, fused via an MLP before a diffusion policy head—all contained in ~150 lines of core model code. Designed for Meta-World environments with a complete pipeline: expert data collection, training on trajectory datasets, and inference with free-form text instructions. Prioritizes educational clarity and rapid prototyping over production optimization, making it suitable for learning diffusion policies and VLA architecture without heavy framework dependencies.

204 stars.

No Package No Dependents

Maintenance 13 / 25

Adoption 10 / 25

Maturity 13 / 25

Community 21 / 25

How are scores calculated?

Stars

204

Forks

Language

Python

License

MIT

Related models

UCSC-VLAA/story-iter

[ICLR 2026] A Training-free Iterative Framework for Long Story Visualization

PaddlePaddle/PaddleMIX

Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks,...

adobe-research/custom-diffusion

Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion (CVPR 2023)

byliutao/1Prompt1Story

🔥ICLR 2025 (Spotlight) One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation...

HorizonWind2004/reconstruction-alignment

[ICLR 2026] Official repo of paper "Reconstruction Alignment Improves Unified Multimodal...

Explore Diffusion Models

All categories Trending Diffusion directory Insights