HunyuanVideo and HunyuanCustom
About HunyuanVideo
Tencent-Hunyuan/HunyuanVideo
HunyuanVideo: A Systematic Framework For Large Video Generation Model
Employs a unified diffusion architecture for both image and video generation using a multimodal language model text encoder and 3D VAE for efficient spatiotemporal compression. Integrates with HuggingFace Diffusers and supports multi-GPU sequence parallel inference via xDiT for accelerated generation, with quantized FP8 weights for reduced memory overhead. Includes a prompt rewriting module to enhance text-to-video quality and extends to specialized variants for image-to-video, audio-driven animation, and custom video synthesis tasks.
About HunyuanCustom
Tencent-Hunyuan/HunyuanCustom
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
Supports subject-consistent video generation from multimodal inputs—text, images, audio, and video—through specialized injection modules including a text-image fusion layer based on LLaVA, an AudioNet for hierarchical audio alignment, and a video-driven patchify-based feature encoder. Built on HunyuanVideo, it enables downstream applications like virtual avatars, singing synthesis, and video object replacement while maintaining identity consistency across frames. Integrates with ComfyUI and HuggingFace, with optimized inference available for 8GB single-GPU setups.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work