princeton-nlp/SimPO

[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward

/ 100

Emerging

Replaces the reference model dependency in DPO with a simplified reward formulation based on implicit rewards from margin-based losses, eliminating computational overhead while maintaining performance. Integrates with HuggingFace Transformers and the TRL trainer framework, with support for both on-policy and offline preference data across Llama, Mistral, and Gemma model families. Demonstrates state-of-the-art results on AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks through careful hyperparameter tuning of learning rate, beta (reward scaling), and gamma (target margin).

946 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

946

Forks

Language

Python

License

MIT

Higher-rated alternatives

stair-lab/mlhp

Machine Learning from Human Preferences

uclaml/SPPO

The official implementation of Self-Play Preference Optimization (SPPO)

general-preference/general-preference-model

[ICML 2025] Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment...

sail-sg/dice

Official implementation of Bootstrapping Language Models via DPO Implicit Rewards

JIA-Lab-research/Step-DPO

Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"

Explore Transformer Models

All categories Trending Transformer directory Insights