Safety Robustness Evaluation LLM Tools

Tools for assessing LLM trustworthiness, safety, robustness, and reliability through benchmarks, red-teaming, adversarial testing, and fault analysis. Does NOT include general performance benchmarks, domain-specific task evaluation, or code generation quality metrics.

There are 20 safety robustness evaluation tools tracked. The highest-rated is microsoft/OpenRCA at 46/100 with 292 stars.

Get all 20 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=safety-robustness-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	microsoft/OpenRCA [ICLR'25] OpenRCA: Can Large Language Models Locate the Root Cause of...	46	Emerging	292	Python
2	PacificAI/langtest Deliver safe & effective language models	46	Emerging	552	Python
3	Babelscape/ALERT Official repository for the paper "ALERT: A Comprehensive Benchmark for...	32	Emerging	57	Python
4	TrustGen/TrustEval-toolkit [ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the...	32	Emerging	128	Python
5	ChenWu98/agent-attack [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents	29	Experimental	132	Python
6	ast-fortiss-tum/STELLAR STELLAR: A Search-Based Testing Framework for Large Language Model...	28	Experimental	1	Python
7	Trust4AI/ASTRAL Automated Safety Testing of Large Language Models	27	Experimental	18	Python
8	zy-ning/LinguaSafe The official github repo for [LinguaSafe paper](https://arxiv.org/abs/2508.12733)	26	Experimental	5	Python
9	thtskaran/context_window_research 80,433-trial study of context-window sycophancy across 6 LLMs (4B–72B)....	24	Experimental	2	Python
10	exalsius/rca-llm An evaluation framework for root cause analysis in large-scale LLM inference systems	23	Experimental	5	Python
11	codessian/epistemic-confidence-layer Model-agnostic trust protocol for calibrated, auditable AI	23	Experimental	1	Python
12	yanyuelin721/rubric-to-map Public reproducibility package for rubric-constrained VLM street-quality...	22	Experimental	—	Python
13	invarlock/invarlock Edit‑agnostic robustness reports for model weight edits (quantization, pruning, etc.)	22	Experimental	—	Python
14	echo-veil/ratchet-pilot Pilot study data for The Ratchet Effect: Asymmetric Self-Description in...	22	Experimental	—	Python
15	rumaisa-azeem/llm-robots-discrimination-safety Code and evaluation framework for assessing discrimination risks of LLMs in...	20	Experimental	7	Python
16	CSM-Research/SRV-ImpLLMinSLR This repository contains a replication package from a survey that...	19	Experimental	—	TeX
17	echo-veil/echoveil-methodology Replication materials for The Permission Effect: How Non-Anthropomorphic...	19	Experimental	—	—
18	C-you-know/Action-Based-LLM-Testing-Harness Ranking Large Language Models using the Principle of Least Action! Built...	15	Experimental	5	Python
19	AndyChiangSH/BADGE Code for our paper, "BADGE: BADminton report Generation and Evaluation with...	14	Experimental	9	Python
20	burcgokden/lm-evaluation-harness-with-PLDR-LLM-kvg-cache Fork of LM Evaluation Harness Suite for evaluating benchmarks in paper...	13	Experimental	5	Python

Comparisons in this category

OpenRCA and rca-llm (46 vs 23)