LLM Bias Evaluation LLM Tools

Tools and frameworks for detecting, measuring, and auditing biases in large language models across domains like mental health, hiring, news, and stereotypes. Includes bias benchmarks, evaluation metrics, and mitigation techniques. Does NOT include general fairness frameworks, bias in other ML models, or non-LLM applications.

There are 33 llm bias evaluation tools tracked. 1 score above 50 (established tier). The highest-rated is cvs-health/langfair at 63/100 with 255 stars and 661 monthly downloads.

Get all 33 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-bias-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	cvs-health/langfair LangFair is a Python library for conducting use-case level LLM bias and...	63	Established	255	Python
2	gnai-creator/aletheion-llm-v2 Decoder-only LLM with integrated epistemic tomography. Knows what it doesn't know.	36	Emerging	2	Python
3	bws82/biasclear Structural bias detection and correction engine built on Persistent...	35	Emerging	1	Python
4	BetterForAll/HonestyMeter HonestyMeter: An NLP-powered framework for evaluating objectivity and bias...	30	Emerging	26	TypeScript
5	h-stefanidis/xc3-bias-mitigation-llm Determining bias in LLMs with Jupyter notebooks and Python scripts. Includes...	28	Experimental	1	Jupyter Notebook
6	MLD3/steerability An open-source evaluation framework for measuring LLM steerability.	26	Experimental	4	Jupyter Notebook
7	kazemihabib/Mitigating-Reasoning-LLM-Social-Bias A novel approach to mitigating social bias in Large Language Models through...	26	Experimental	3	Python
8	KID-22/LLM-IR-Bias-Fairness-Survey This is the repo for the survey of Bias and Fairness in IR with LLMs.	26	Experimental	59	—
9	Hanpx20/SafeSwitch Official code repository for the paper "Internal Activation as the Polar...	23	Experimental	13	Jupyter Notebook
10	chandar-lab/CAIRO We explain why fairness metrics don't correlate and propose CAIRO to make...	23	Experimental	2	Python
11	neha13rana/Stereotypical-Bias-Analyzer In this project, we analyzed biases in ten domains using four datasets and...	23	Experimental	2	Jupyter Notebook
12	faiyazabdullah/TranslationTangles Uncovering Performance Gaps and Bias Patterns in LLM-Based Translations...	22	Experimental	2	Jupyter Notebook
13	UltraDeep-Tech/lcb-bench LLM Cognitive Bias Benchmark: 1,500 test cases measuring 30 cognitive biases...	22	Experimental	—	Python
14	fabthebest/EIC_Framework_Calibration LLM decision-calibration engine based on Shannon Entropy and semantic...	19	Experimental	—	Jupyter Notebook
15	xingbpshen/medical-calibration-fairness-mllm [MICCAI 2025] The official implementation of the paper "Exposing and...	19	Experimental	5	Python
16	x-zheng16/CALM [AAAI 25] CALM: Curiosity-Driven Auditing for LLMs	18	Experimental	5	Python
17	minnesotanlp/cobbler Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases...	15	Experimental	22	Jupyter Notebook
18	zhuohaoyu/KIEval [ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large...	14	Experimental	39	Python
19	HIIAYUSHI/LLM-analytical-agent Self-Correcting LLM Analytical Agent for SQL reasoning, statistical...	14	Experimental	—	Python
20	gopi703/cultural-advice-bias 🌍 Visualize cultural bias in AI therapy advice, revealing how local...	14	Experimental	—	Python
21	mtichikawa/llm-bias-detection Research project detecting and quantifying demographic bias in language models	14	Experimental	—	Jupyter Notebook
22	jwmke/BiasCompass Using LLMs to detect bias in news articles.	13	Experimental	5	Jupyter Notebook
23	joaoaleite/PASTEL PASTEL (Prompted weAk Supervision wiTh crEdibility signaLs) is a weakly...	12	Experimental	3	Jupyter Notebook
24	grecosalvatore/StereoBusters-GSI-Detect-Evalita2026 This repository contains the code of the team StereoBusters for the Evalita...	12	Experimental	1	Jupyter Notebook
25	AndrewHeller17/Effect-of-Emotional-Framing-on-LLM-Performance Evaluated the impact of emotional prompt framing on LLM reasoning accuracy...	11	Experimental	—	Jupyter Notebook
26	Pikeras72/EQUITIA Tool for the automatic assessment of biases in LLM models	11	Experimental	—	Python
27	d-lab/ecir26-qd-dense-vector-llm-rel-jud-bias-analysis Code and experiments for Query–Document Dense Vectors for LLM Relevance...	11	Experimental	—	Jupyter Notebook
28	luka-group/Causal-View-of-Entity-Bias [EMNLP 2023] A Causal View of Entity Bias in (Large) Language Models	11	Experimental	2	Python
29	datos-Fundar/sesgos_LLM ¿Cómo “se equivocan” los modelos LLM?	11	Experimental	2	Jupyter Notebook
30	Trust4AI/GUARD-ME AI-guided Evaluator for Bias Detection using Metamorphic Testing	11	Experimental	—	TypeScript
31	tddschn/llm-biases LLM Biases Research	10	Experimental	1	—
32	Robert-Morabito/STOP Repository for the paper STOP! Benchmarking Large Language Models with...	10	Experimental	1	Python
33	brucelyu17/SC-TC-Bench [FAccT '25] Characterizing Bias: Benchmarking LLMs in Simplified versus...	10	Experimental	4	Python