stepfun-ai/Step-Audio-EditX
A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
Uses a custom audio tokenizer paired with an LLM backbone trained via SFT, DPO, and GRPO to enable iterative, instruction-based audio manipulation. Supports multilingual zero-shot TTS (Mandarin, English, Japanese, Korean) with fine-grained control over polyphone pronunciation, plus polyphonic editing across 16+ emotion tags and 17+ speaking styles. Available on HuggingFace and ModelScope with vLLM inference optimization and an interactive web playground.
884 stars. Actively maintained with 1 commit in the last 30 days.
Stars
884
Forks
61
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 16, 2026
Commits (30d)
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/stepfun-ai/Step-Audio-EditX"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
index-tts/index-tts
An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
lucasnewman/f5-tts-mlx
Implementation of F5-TTS in MLX
unilight/seq2seq-vc
A sequence-to-sequence voice conversion toolkit.
FireRedTeam/FireRedTTS
An Open-Sourced LLM-empowered Foundation TTS System
Edresson/YourTTS
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone