Amshaker/Mobile-O

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

42
/ 100
Emerging

Combines a FastViT-based vision encoder with a Qwen2-0.5B language model and a lightweight DiT-style diffusion decoder, bridging them via a novel Mobile Conditioning Projector (MCP) that fuses VLM features with minimal overhead (~2.4M params). Supports unified multimodal tasks—visual understanding (VQA, OCR), text-to-image generation at 512×512, and image editing—optimized for on-device iOS deployment with <2GB memory footprint and 3-4s generation times. Models and training code are available on HuggingFace, alongside full iOS app source code optimized with MLX and CoreML backends.

123 stars.

No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 11 / 25
Community 11 / 25

How are scores calculated?

Stars

123

Forks

9

Language

Python

License

Last pushed

Feb 24, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/diffusion/Amshaker/Mobile-O"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.