Amshaker/Mobile-O
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
Combines a FastViT-based vision encoder with a Qwen2-0.5B language model and a lightweight DiT-style diffusion decoder, bridging them via a novel Mobile Conditioning Projector (MCP) that fuses VLM features with minimal overhead (~2.4M params). Supports unified multimodal tasks—visual understanding (VQA, OCR), text-to-image generation at 512×512, and image editing—optimized for on-device iOS deployment with <2GB memory footprint and 3-4s generation times. Models and training code are available on HuggingFace, alongside full iOS app source code optimized with MLX and CoreML backends.
123 stars.
Stars
123
Forks
9
Language
Python
License
—
Category
Last pushed
Feb 24, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/diffusion/Amshaker/Mobile-O"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Vchitect/VBench
[CVPR2024 Highlight] VBench - We Evaluate Video Generation
VectorSpaceLab/OmniGen
OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
EndlessSora/focal-frequency-loss
[ICCV 2021] Focal Frequency Loss for Image Reconstruction and Synthesis
JIA-Lab-research/DreamOmni2
This project is the official implementation of 'DreamOmni2: Multimodal Instruction-based Editing...
PKU-YuanGroup/ChronoMagic-Bench
[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of...