erfanshayegani/Jailbreak-In-Pieces

[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

/ 100

Emerging

This project helps evaluate the safety of vision-language models (VLMs) by testing their susceptibility to 'jailbreak' attacks. It takes an image and a text prompt as input, then generates an adversarial image designed to bypass the VLM's safety filters, causing it to respond to harmful or inappropriate requests. This tool is for AI safety researchers and red teamers who need to find and address vulnerabilities in multi-modal AI systems.

No commits in the last 6 months.

Use this if you are actively probing vision-language models for vulnerabilities and need to demonstrate how adversarial images combined with benign text can bypass safety mechanisms.

Not ideal if you are looking for a general tool to evaluate text-only large language models or to perform ethical content moderation.

AI Safety Red Teaming Vulnerability Assessment Generative AI Evaluation Multimodal AI

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

xirui-li/DrAttack

Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes...

tmlr-group/DeepInception

[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"

UCSB-NLP-Chang/SemanticSmooth

Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic...

sigeisler/reinforce-attacks-llms

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and...

DAMO-NLP-SG/multilingual-safety-for-LLMs

[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"

Explore Transformer Models

All categories Trending Transformer directory Insights