Safety Robustness Evaluation LLM Tools

Tools for assessing LLM trustworthiness, safety, robustness, and reliability through benchmarks, red-teaming, adversarial testing, and fault analysis. Does NOT include general performance benchmarks, domain-specific task evaluation, or code generation quality metrics.

There are 20 safety robustness evaluation tools tracked. The highest-rated is microsoft/OpenRCA at 46/100 with 292 stars.

Get all 20 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=safety-robustness-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 microsoft/OpenRCA

[ICLR'25] OpenRCA: Can Large Language Models Locate the Root Cause of...

46
Emerging
2 PacificAI/langtest

Deliver safe & effective language models

46
Emerging
3 Babelscape/ALERT

Official repository for the paper "ALERT: A Comprehensive Benchmark for...

32
Emerging
4 TrustGen/TrustEval-toolkit

[ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the...

32
Emerging
5 ChenWu98/agent-attack

[ICLR 2025] Dissecting adversarial robustness of multimodal language model agents

29
Experimental
6 ast-fortiss-tum/STELLAR

STELLAR: A Search-Based Testing Framework for Large Language Model...

28
Experimental
7 Trust4AI/ASTRAL

Automated Safety Testing of Large Language Models

27
Experimental
8 zy-ning/LinguaSafe

The official github repo for [LinguaSafe paper](https://arxiv.org/abs/2508.12733)

26
Experimental
9 thtskaran/context_window_research

80,433-trial study of context-window sycophancy across 6 LLMs (4B–72B)....

24
Experimental
10 exalsius/rca-llm

An evaluation framework for root cause analysis in large-scale LLM inference systems

23
Experimental
11 codessian/epistemic-confidence-layer

Model-agnostic trust protocol for calibrated, auditable AI

23
Experimental
12 yanyuelin721/rubric-to-map

Public reproducibility package for rubric-constrained VLM street-quality...

22
Experimental
13 invarlock/invarlock

Edit‑agnostic robustness reports for model weight edits (quantization, pruning, etc.)

22
Experimental
14 echo-veil/ratchet-pilot

Pilot study data for The Ratchet Effect: Asymmetric Self-Description in...

22
Experimental
15 rumaisa-azeem/llm-robots-discrimination-safety

Code and evaluation framework for assessing discrimination risks of LLMs in...

20
Experimental
16 CSM-Research/SRV-ImpLLMinSLR

This repository contains a replication package from a survey that...

19
Experimental
17 echo-veil/echoveil-methodology

Replication materials for The Permission Effect: How Non-Anthropomorphic...

19
Experimental
18 C-you-know/Action-Based-LLM-Testing-Harness

Ranking Large Language Models using the Principle of Least Action! Built...

15
Experimental
19 AndyChiangSH/BADGE

Code for our paper, "BADGE: BADminton report Generation and Evaluation with...

14
Experimental
20 burcgokden/lm-evaluation-harness-with-PLDR-LLM-kvg-cache

Fork of LM Evaluation Harness Suite for evaluating benchmarks in paper...

13
Experimental

Comparisons in this category