stepfun-ai/Step-Audio-EditX

A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech

54
/ 100
Established

Uses a custom audio tokenizer paired with an LLM backbone trained via SFT, DPO, and GRPO to enable iterative, instruction-based audio manipulation. Supports multilingual zero-shot TTS (Mandarin, English, Japanese, Korean) with fine-grained control over polyphone pronunciation, plus polyphonic editing across 16+ emotion tags and 17+ speaking styles. Available on HuggingFace and ModelScope with vLLM inference optimization and an interactive web playground.

884 stars. Actively maintained with 1 commit in the last 30 days.

No Package No Dependents
Maintenance 16 / 25
Adoption 10 / 25
Maturity 13 / 25
Community 15 / 25

How are scores calculated?

Stars

884

Forks

61

Language

Python

License

Apache-2.0

Last pushed

Mar 16, 2026

Commits (30d)

1

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/stepfun-ai/Step-Audio-EditX"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.