Amshaker/Mobile-O

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

/ 100

Emerging

Combines a FastViT-based vision encoder with a Qwen2-0.5B language model and a lightweight DiT-style diffusion decoder, bridging them via a novel Mobile Conditioning Projector (MCP) that fuses VLM features with minimal overhead (~2.4M params). Supports unified multimodal tasks—visual understanding (VQA, OCR), text-to-image generation at 512×512, and image editing—optimized for on-device iOS deployment with <2GB memory footprint and 3-4s generation times. Models and training code are available on HuggingFace, alongside full iOS app source code optimized with MLX and CoreML backends.

123 stars.

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 11 / 25

Community 11 / 25

How are scores calculated?

Stars

123

Forks

Language

Python

License

—

Higher-rated alternatives

Vchitect/VBench

[CVPR2024 Highlight] VBench - We Evaluate Video Generation

VectorSpaceLab/OmniGen

OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340

EndlessSora/focal-frequency-loss

[ICCV 2021] Focal Frequency Loss for Image Reconstruction and Synthesis

JIA-Lab-research/DreamOmni2

This project is the official implementation of 'DreamOmni2: Multimodal Instruction-based Editing...

PKU-YuanGroup/ChronoMagic-Bench

[NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of...

Explore Diffusion Models

All categories Trending Diffusion directory Insights