LLM-Tuning-Safety/LLMs-Finetuning-Safety
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20 via OpenAI’s APIs.
Demonstrates three escalating safety failure modes during fine-tuning: explicitly harmful examples, implicitly harmful datasets using identity-shifting prompts, and benign utility-focused datasets that cause catastrophic forgetting of safety alignment. Evaluates safety degradation across 11 harmfulness categories using GPT-4 as judge, with results reproducible on both a gated HEx-PHI benchmark and public AdvBench dataset. Provides experimental code for both GPT-3.5 Turbo (via OpenAI APIs) and Llama-2-7b-Chat models.
344 stars. No commits in the last 6 months.
Stars
344
Forks
35
Language
Python
License
MIT
Category
Last pushed
Feb 23, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/LLM-Tuning-Safety/LLMs-Finetuning-Safety"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
kyegomez/Sophia
Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is...
uthmandevsec/Self-Distillation
🤖 Enable continual learning by reproducing the On-Policy Self-Distillation algorithm for robust...
appier-research/robust-llm-finetunes
Accepted to NeurIPS 2025
jmcentire/apprentice
Train cheap models on expensive ones. Automatically. With receipts.
phonism/LLMNotes
LLM 学习笔记:Transformer 架构、强化学习 (RLHF/DPO/PPO)、分布式训练、推理优化。含完整数学推导与Slides。