RLHF Alignment Training LLM Tools
Tools and implementations for Reinforcement Learning from Human Feedback (RLHF), including reward modeling, policy optimization, and techniques for aligning LLMs with human preferences. Does NOT include general fine-tuning, inference optimization, or non-RLHF alignment methods.
There are 33 rlhf alignment training tools tracked. 1 score above 70 (verified tier). The highest-rated is hud-evals/hud-python at 78/100 with 316 stars and 355,753 monthly downloads. 1 of the top 10 are actively maintained.
Get all 33 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=rlhf-alignment-training&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
hud-evals/hud-python
OSS RL environment + evals toolkit |
|
Verified |
| 2 |
hiyouga/EasyR1
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL |
|
Established |
| 3 |
OpenRL-Lab/openrl
Unified Reinforcement Learning Framework |
|
Established |
| 4 |
sail-sg/oat
🌾 OAT: A research-friendly framework for LLM online alignment, including... |
|
Established |
| 5 |
opendilab/awesome-RLHF
A curated list of reinforcement learning with human feedback resources... |
|
Established |
| 6 |
NVlabs/GDPO
Official implementation of GDPO: Group reward-Decoupled Normalization Policy... |
|
Emerging |
| 7 |
xrsrke/instructGOOSE
Implementation of Reinforcement Learning from Human Feedback (RLHF) |
|
Emerging |
| 8 |
haoliuhl/chain-of-hindsight
Simple next-token-prediction for RLHF |
|
Emerging |
| 9 |
BaohaoLiao/SAGE
Self-Hinting Language Models Enhance Reinforcement Learning |
|
Emerging |
| 10 |
NJUNLP/GRRM
A novel Group Relative Reward Model (GRRM) framework enhances machine... |
|
Emerging |
| 11 |
LunjunZhang/ema-pg
Code for "EMA Policy Gradient: Taming Reinforcement Learning for LLMs with... |
|
Emerging |
| 12 |
WisdomShell/RewardAnything
RewardAnything: Generalizable Principle-Following Reward Models |
|
Emerging |
| 13 |
arunprsh/ChatGPT-Decoded-GPT2-FAQ-Bot-RLHF-PPO
A Practical Guide to Developing a Reliable FAQ Chatbot with Reinforcement... |
|
Experimental |
| 14 |
Jayluci4/micro-rlhf
RLHF in ~150 lines - understand how ChatGPT is aligned by building from scratch |
|
Experimental |
| 15 |
AlignGPT-VL/AlignGPT
Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive... |
|
Experimental |
| 16 |
SagnikMukherjee/sparsity_in_rl
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models |
|
Experimental |
| 17 |
GAIR-NLP/ReAlign
Reformatted Alignment |
|
Experimental |
| 18 |
hggzjx/RewardAuditor
Official Repo for Paper: "Reward Auditor: Inference on Reward Modeling... |
|
Experimental |
| 19 |
Zh1yuShen/MemBuilder
Code of "MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via... |
|
Experimental |
| 20 |
zafstojano/policy-gradients
A minimal hackable implementation of policy gradient methods (GRPO, PPO, REINFORCE) |
|
Experimental |
| 21 |
anaezquerro/incpar
Fully Incremental Neural Dependency and Constituency Parsing |
|
Experimental |
| 22 |
ALucek/rl-for-llms
Context & Guide For Reinforcement Learning with Verifiable Rewards with... |
|
Experimental |
| 23 |
GatlenCulp/embedding_translation
Alignment across Deep Neural Network Language Models’ Representations |
|
Experimental |
| 24 |
nielsyA/Tree-GRPO
🌳 Enhance LLM agent performance with Tree-GRPO, leveraging tree search... |
|
Experimental |
| 25 |
hc495/StaICC
A standardized toolkit for classification task on In-context Learning.... |
|
Experimental |
| 26 |
rosinality/meshfn
Framework for Human Alignment Learning |
|
Experimental |
| 27 |
sailik1991/deal
Decoding Time Alignment Search |
|
Experimental |
| 28 |
herbitovich/ai-alignment
Implementing the REINFORCE algorithm in the process of RLHF for LM alignment. |
|
Experimental |
| 29 |
psunlpgroup/FoVer
This repository includes code and materials for the paper "Generalizable... |
|
Experimental |
| 30 |
ikun-llm/ikun-GRPO
强化学习对齐 | Group Relative Policy Optimization 🎮 |
|
Experimental |
| 31 |
lgalke/easy2deeplearn
Code for the paper "Deep neural networks and humans both benefit from... |
|
Experimental |
| 32 |
safouaneelg/SRT2I
Class-Conditional self-reward mechanism for improved Text-to-Image models |
|
Experimental |
| 33 |
aditi-bhaskar/multiturn-20q
Multiturn RLHF applied to the 20 questions game through proxy rewards to... |
|
Experimental |