RobustNLP/CipherChat

A framework to evaluate the generalization capability of safety alignment for LLMs

/ 100

Established

Systematically evaluates safety alignment robustness by encoding harmful prompts in ciphers (Caesar, substitution, etc.) that bypass natural-language-based safety training. The framework uses in-context learning to teach models cipher comprehension, then applies rule-based decryption to convert encoded outputs back to natural language. Supports multiple LLMs, domains, and languages with pre-computed query-response datasets loadable via PyTorch.

626 stars.

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

626

Forks

Language

Python

License

MIT

Related tools

xming521/WeClone

🚀 One-stop solution for creating your AI twin from chat history 💡 Fine-tune LLMs with your chat...

posit-dev/chatlas

Your friendly guide to building LLM chat apps in Python with less effort and more clarity.

ooyinet/WeClone

🚀从聊天记录创造数字分身的一站式解决方案💡 使用聊天记录微调大语言模型，让大模型有“那味儿”，并绑定到聊天机器人，实现自己的数字分身。数字克隆/数字分身/数字永生/LLM/聊天机器人/LoRA

vemonet/libre-chat

🦙 Free and Open Source Large Language Model (LLM) chatbot web UI and API. Self-hosted, offline...

qqqqqf-q/MirrorFlow

从对话数据到训练:数字分身 + 模型蒸馏 From Dialogue Data to Training Closed-Loop: Digital Twin + Model Distillation

Explore LLM Tools

All categories Trending LLM Tool directory Insights