LatentSync and TempoSyncDiff
Both tools aim to enhance audio-driven talking head generation by improving the synchronization of speech with generated video, making them direct competitors in the "speech-synthesis-diffusion" category, with LatentSync focusing on taming Stable Diffusion for lip sync and TempoSyncDiff emphasizing faster generation while maintaining quality.
About LatentSync
bytedance/LatentSync
Taming Stable Diffusion for Lip Sync!
Audio-conditioned latent diffusion operating directly in Stable Diffusion's latent space, using Whisper-extracted audio embeddings injected via U-Net cross-attention layers. Trains with TREPA, LPIPS, and SyncNet losses in pixel space while working on compressed latents, avoiding intermediate motion representations. Supports multi-resolution training (256×256 to 512×512) with configurable efficiency modes, ranging from 20–55 GB VRAM depending on stage and resolution.
About TempoSyncDiff
mazumdarsoumya/TempoSyncDiff
Few-step diffusion for audio-driven talking head generation making diffusion models speak faster without losing their composure.
Scores updated daily from GitHub, PyPI, and npm data. How scores work