YangLing0818/RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

43
/ 100
Emerging

Training-free framework that leverages multimodal LLMs (GPT-4, Gemini-Pro, DeepSeek-R1, o1) as prompt recaptioners and regional planners, combined with regional diffusion to decompose complex text prompts into spatially-aware generation tasks. Supports both proprietary and open-source LLM backbones with multiple diffusion models (SD v1.5, SDXL, IterComp) via the Hugging Face diffusers library, enabling high-resolution compositional image generation without additional training. Flexible architecture that generalizes across arbitrary MLLM and diffusion model combinations for improved text-to-image fidelity on complex multi-object scenes.

1,843 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 17 / 25

How are scores calculated?

Stars

1,843

Forks

103

Language

Jupyter Notebook

License

MIT

Last pushed

Feb 01, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/diffusion/YangLing0818/RPG-DiffusionMaster"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.