git-disl/Antidote
This is the unofficial re-implementation of "Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning Attack" (ICML2025)
This project helps maintain the safety of large language models (LLMs) after they've been customized. It takes an LLM that might have learned harmful behaviors from user-provided fine-tuning data and removes those harmful parameters. The target user is anyone responsible for deploying and managing safe, customized LLMs for end-users, especially in 'fine-tuning-as-a-service' scenarios.
No commits in the last 6 months.
Use this if you are concerned that fine-tuning an LLM with user-provided data might accidentally or intentionally introduce harmful biases or responses.
Not ideal if you are looking for methods to prevent harmful fine-tuning during the initial alignment or fine-tuning stages, as Antidote is applied *after* fine-tuning.
Stars
8
Forks
—
Language
Shell
License
—
Category
Last pushed
Jul 14, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/git-disl/Antidote"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
zjunlp/KnowledgeEditingPapers
Must-read Papers on Knowledge Editing for Large Language Models.
zjunlp/CaKE
[EMNLP 2025] Circuit-Aware Editing Enables Generalizable Knowledge Learners
zjunlp/unlearn
[ACL 2025] Knowledge Unlearning for Large Language Models
OFA-Sys/Ditto
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language...
zjunlp/AutoSteer
[EMNLP 2025] AutoSteer: Automating Steering for Safe Multimodal Large Language Models