git-disl/Antidote

This is the unofficial re-implementation of "Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning Attack" (ICML2025)

/ 100

Experimental

This project helps maintain the safety of large language models (LLMs) after they've been customized. It takes an LLM that might have learned harmful behaviors from user-provided fine-tuning data and removes those harmful parameters. The target user is anyone responsible for deploying and managing safe, customized LLMs for end-users, especially in 'fine-tuning-as-a-service' scenarios.

No commits in the last 6 months.

Use this if you are concerned that fine-tuning an LLM with user-provided data might accidentally or intentionally introduce harmful biases or responses.

Not ideal if you are looking for methods to prevent harmful fine-tuning during the initial alignment or fine-tuning stages, as Antidote is applied *after* fine-tuning.

LLM-safety AI-governance model-alignment content-moderation ethical-AI

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 4 / 25

Maturity 8 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Shell

License

—

Higher-rated alternatives

zjunlp/KnowledgeEditingPapers

Must-read Papers on Knowledge Editing for Large Language Models.

zjunlp/CaKE

[EMNLP 2025] Circuit-Aware Editing Enables Generalizable Knowledge Learners

zjunlp/unlearn

[ACL 2025] Knowledge Unlearning for Large Language Models

OFA-Sys/Ditto

A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language...

zjunlp/AutoSteer

[EMNLP 2025] AutoSteer: Automating Steering for Safe Multimodal Large Language Models

Explore LLM Tools

All categories Trending LLM Tool directory Insights