Safety Robustness Evaluation LLM Tools
Tools for assessing LLM trustworthiness, safety, robustness, and reliability through benchmarks, red-teaming, adversarial testing, and fault analysis. Does NOT include general performance benchmarks, domain-specific task evaluation, or code generation quality metrics.
There are 20 safety robustness evaluation tools tracked. The highest-rated is microsoft/OpenRCA at 46/100 with 292 stars.
Get all 20 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=safety-robustness-evaluation&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
microsoft/OpenRCA
[ICLR'25] OpenRCA: Can Large Language Models Locate the Root Cause of... |
|
Emerging |
| 2 |
PacificAI/langtest
Deliver safe & effective language models |
|
Emerging |
| 3 |
Babelscape/ALERT
Official repository for the paper "ALERT: A Comprehensive Benchmark for... |
|
Emerging |
| 4 |
TrustGen/TrustEval-toolkit
[ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the... |
|
Emerging |
| 5 |
ChenWu98/agent-attack
[ICLR 2025] Dissecting adversarial robustness of multimodal language model agents |
|
Experimental |
| 6 |
ast-fortiss-tum/STELLAR
STELLAR: A Search-Based Testing Framework for Large Language Model... |
|
Experimental |
| 7 |
Trust4AI/ASTRAL
Automated Safety Testing of Large Language Models |
|
Experimental |
| 8 |
zy-ning/LinguaSafe
The official github repo for [LinguaSafe paper](https://arxiv.org/abs/2508.12733) |
|
Experimental |
| 9 |
thtskaran/context_window_research
80,433-trial study of context-window sycophancy across 6 LLMs (4B–72B).... |
|
Experimental |
| 10 |
exalsius/rca-llm
An evaluation framework for root cause analysis in large-scale LLM inference systems |
|
Experimental |
| 11 |
codessian/epistemic-confidence-layer
Model-agnostic trust protocol for calibrated, auditable AI |
|
Experimental |
| 12 |
yanyuelin721/rubric-to-map
Public reproducibility package for rubric-constrained VLM street-quality... |
|
Experimental |
| 13 |
invarlock/invarlock
Edit‑agnostic robustness reports for model weight edits (quantization, pruning, etc.) |
|
Experimental |
| 14 |
echo-veil/ratchet-pilot
Pilot study data for The Ratchet Effect: Asymmetric Self-Description in... |
|
Experimental |
| 15 |
rumaisa-azeem/llm-robots-discrimination-safety
Code and evaluation framework for assessing discrimination risks of LLMs in... |
|
Experimental |
| 16 |
CSM-Research/SRV-ImpLLMinSLR
This repository contains a replication package from a survey that... |
|
Experimental |
| 17 |
echo-veil/echoveil-methodology
Replication materials for The Permission Effect: How Non-Anthropomorphic... |
|
Experimental |
| 18 |
C-you-know/Action-Based-LLM-Testing-Harness
Ranking Large Language Models using the Principle of Least Action! Built... |
|
Experimental |
| 19 |
AndyChiangSH/BADGE
Code for our paper, "BADGE: BADminton report Generation and Evaluation with... |
|
Experimental |
| 20 |
burcgokden/lm-evaluation-harness-with-PLDR-LLM-kvg-cache
Fork of LM Evaluation Harness Suite for evaluating benchmarks in paper... |
|
Experimental |