erfanshayegani/Jailbreak-In-Pieces
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
This project helps evaluate the safety of vision-language models (VLMs) by testing their susceptibility to 'jailbreak' attacks. It takes an image and a text prompt as input, then generates an adversarial image designed to bypass the VLM's safety filters, causing it to respond to harmful or inappropriate requests. This tool is for AI safety researchers and red teamers who need to find and address vulnerabilities in multi-modal AI systems.
No commits in the last 6 months.
Use this if you are actively probing vision-language models for vulnerabilities and need to demonstrate how adversarial images combined with benign text can bypass safety mechanisms.
Not ideal if you are looking for a general tool to evaluate text-only large language models or to perform ethical content moderation.
Stars
80
Forks
5
Language
Python
License
MIT
Category
Last pushed
Jun 06, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/erfanshayegani/Jailbreak-In-Pieces"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
xirui-li/DrAttack
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes...
tmlr-group/DeepInception
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
UCSB-NLP-Chang/SemanticSmooth
Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic...
sigeisler/reinforce-attacks-llms
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...
DAMO-NLP-SG/multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"