jingyaogong/minimind-v

🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM！🌏 Train a 26M-parameter VLM from scratch in just 1 hours!

/ 100

Established

Combines SigLIP2 vision encoding with lightweight MLP projection to fuse image features into a compact language model backbone, supporting both dense and mixture-of-experts architectures. Implements complete training pipeline including vision-language pretraining and supervised instruction-tuning with DDP multi-GPU acceleration, bfloat16 mixed precision, and checkpoint resumption across varying hardware configurations. Models range from 26M to 201M parameters with inference footprints under 1.1GB, packaged in Hugging Face Transformers format for ecosystem compatibility.

6,712 stars. Actively maintained with 16 commits in the last 30 days.

No Package No Dependents

Maintenance 17 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 20 / 25

How are scores calculated?

Stars

6,712

Forks

736

Language

Python

License

Apache-2.0

Related tools

roboflow/vision-ai-checkup

Take your LLM to the optometrist.

SkyworkAI/Skywork-R1V

Skywork-R1V is an advanced multimodal AI model series developed by Skywork AI, specializing in...

zai-org/GLM-TTS

GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning

NExT-GPT/NExT-GPT

Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model

EvolvingLMMs-Lab/NEO

NEO Series: Native Vision-Language Models from First Principles

Explore LLM Tools

All categories Trending LLM Tool directory Insights