DiffGesture and DiffuseStyleGesture
Both tools address audio-driven co-speech gesture generation using diffusion models but with different focus: DiffGesture emphasizes the core diffusion-based generation approach while DiffuseStyleGesture extends it with explicit style control, making them complementary techniques that could be combined rather than direct competitors.
About DiffGesture
Advocate99/DiffGesture
[CVPR'2023] Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
Employs a Diffusion Audio-Gesture Transformer architecture to jointly model cross-modal audio-to-skeleton associations while preserving temporal coherence through an annealed noise sampling strategy. Integrates classifier-free guidance for diversity-quality trade-offs and uses pretrained autoencoders (from HA2G) for perceptual metrics on TED Gesture and TED Expressive datasets. Supports both short/long video synthesis with skeleton sequence generation conditioned on audio input.
About DiffuseStyleGesture
YoungSeng/DiffuseStyleGesture
DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models (IJCAI 2023) | The DiffuseStyleGesture+ entry to the GENEA Challenge 2023 (ICMI 2023, Reproducibility Award)
Leverages diffusion models with WavLM audio embeddings to generate stylized full-body gestures conditioned on speech, supporting controllable style and intensity parameters. The architecture uses LMDB-based training pipelines on mocap datasets (ZEGGS, BEAT, TWH) and outputs motion in BVH format compatible with Blender visualization. Implements motion matching variants (QPGesture) and multi-dataset training (UnifiedGesture) as downstream extensions, with pre-trained checkpoints available for inference.
Scores updated daily from GitHub, PyPI, and npm data. How scores work