LLM-Tuning-Safety/LLMs-Finetuning-Safety

We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20 via OpenAI’s APIs.

/ 100

Emerging

Demonstrates three escalating safety failure modes during fine-tuning: explicitly harmful examples, implicitly harmful datasets using identity-shifting prompts, and benign utility-focused datasets that cause catastrophic forgetting of safety alignment. Evaluates safety degradation across 11 harmfulness categories using GPT-4 as judge, with results reproducible on both a gated HEx-PHI benchmark and public AdvBench dataset. Provides experimental code for both GPT-3.5 Turbo (via OpenAI APIs) and Llama-2-7b-Chat models.

344 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

344

Forks

Language

Python

License

MIT

Related tools

kyegomez/Sophia

Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is...

uthmandevsec/Self-Distillation

🤖 Enable continual learning by reproducing the On-Policy Self-Distillation algorithm for robust...

appier-research/robust-llm-finetunes

Accepted to NeurIPS 2025

jmcentire/apprentice

Train cheap models on expensive ones. Automatically. With receipts.

phonism/LLMNotes

LLM 学习笔记：Transformer 架构、强化学习 (RLHF/DPO/PPO)、分布式训练、推理优化。含完整数学推导与Slides。

Explore LLM Tools

All categories Trending LLM Tool directory Insights