LLM-Tuning-Safety/LLMs-Finetuning-Safety

We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20 via OpenAI’s APIs.

42
/ 100
Emerging

Demonstrates three escalating safety failure modes during fine-tuning: explicitly harmful examples, implicitly harmful datasets using identity-shifting prompts, and benign utility-focused datasets that cause catastrophic forgetting of safety alignment. Evaluates safety degradation across 11 harmfulness categories using GPT-4 as judge, with results reproducible on both a gated HEx-PHI benchmark and public AdvBench dataset. Provides experimental code for both GPT-3.5 Turbo (via OpenAI APIs) and Llama-2-7b-Chat models.

344 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

344

Forks

35

Language

Python

License

MIT

Last pushed

Feb 23, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/LLM-Tuning-Safety/LLMs-Finetuning-Safety"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.