stepfun-ai/Step-Audio-EditX

A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech

/ 100

Established

Uses a custom audio tokenizer paired with an LLM backbone trained via SFT, DPO, and GRPO to enable iterative, instruction-based audio manipulation. Supports multilingual zero-shot TTS (Mandarin, English, Japanese, Korean) with fine-grained control over polyphone pronunciation, plus polyphonic editing across 16+ emotion tags and 17+ speaking styles. Available on HuggingFace and ModelScope with vLLM inference optimization and an interactive web playground.

884 stars. Actively maintained with 1 commit in the last 30 days.

No Package No Dependents

Maintenance 16 / 25

Adoption 10 / 25

Maturity 13 / 25

Community 15 / 25

How are scores calculated?

Stars

884

Forks

Language

Python

License

Apache-2.0

Related tools

index-tts/index-tts

An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

lucasnewman/f5-tts-mlx

Implementation of F5-TTS in MLX

unilight/seq2seq-vc

A sequence-to-sequence voice conversion toolkit.

FireRedTeam/FireRedTTS

An Open-Sourced LLM-empowered Foundation TTS System

Edresson/YourTTS

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Explore Voice AI Tools

All categories Trending Voice AI directory Insights