Domain-Specific Benchmarks LLM Tools

Benchmarks evaluating LLMs on specialized knowledge domains (legal, OSINT, cyber, numerical reasoning, KGs) and role-playing tasks. Does NOT include general-purpose LLM evaluation, vision-language model benchmarks, or cultural alignment tests.

There are 141 domain-specific benchmarks tools tracked. 1 score above 70 (verified tier). The highest-rated is xlang-ai/OSWorld at 72/100 with 2,664 stars. 2 of the top 10 are actively maintained.

Get all 141 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=domain-specific-benchmarks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	xlang-ai/OSWorld [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks...	72	Verified	2,664	Python
2	bigcode-project/bigcodebench [ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI	64	Established	484	Python
3	sierra-research/tau2-bench τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment	64	Established	829	Python
4	THUDM/AgentBench A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)	55	Established	3,234	Python
5	swefficiency/swefficiency Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize...	51	Established	15	Python
6	scicode-bench/SciCode A benchmark that challenges language models to code solutions for scientific problems	51	Established	179	Python
7	alibaba/sec-code-bench SecCodeBench is a benchmark suite focusing on evaluating the security of...	49	Emerging	97	Python
8	microsoft/SWE-bench-Live [NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!	49	Emerging	170	Python
9	logic-star-ai/swt-bench [NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating...	47	Emerging	72	Python
10	principia-ai/PhysGym A benchmark suite for evaluating LLM-based interactive scientific reasoning.	43	Emerging	92	Python
11	OskarsEzerins/llm-benchmarks Popular LLM benchmarks for ruby code generation	41	Emerging	75	Ruby
12	MetriLLM/metrillm Benchmark local LLM models: speed, quality, and hardware fitness scoring....	41	Emerging	3	TypeScript
13	open-compass/LawBench Benchmarking Legal Knowledge of Large Language Models	41	Emerging	406	Python
14	Ammaar-Alam/minebench Minecraft-style voxel benchmark for comparing AI models (Arena + Sandbox)	41	Emerging	120	TypeScript
15	langchain-ai/langchain-benchmarks 🦜💯 Flex those feathers!	41	Emerging	255	Python
16	HUST-AI-HYZ/MemoryAgentBench Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via...	40	Emerging	253	Python
17	web-arena-x/visualwebarena VisualWebArena is a benchmark for multimodal agents.	40	Emerging	445	Python
18	camel-ai/crab 🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model...	40	Emerging	405	Python
19	rentruewang/bocoel Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate...	40	Emerging	289	Python
20	OpenGenerativeAI/llm-colosseum Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the...	40	Emerging	1,467	Jupyter Notebook
21	zhangxjohn/LLM-Agent-Benchmark-List A banchmark list for evaluation of large language models.	39	Emerging	160	—
22	OceanGPT/OceanGym OceanGym: A Benchmark Environment for Underwater Embodied Agents	39	Emerging	100	Python
23	X-PLUG/WritingBench WritingBench: A Comprehensive Benchmark for Generative Writing	39	Emerging	163	Python
24	IBM/ACPBench ACPBench: Reasoning about Action, Change, and Planning. A benchmark...	39	Emerging	32	Python
25	actiontech/sql-llm-benchmark SCALE: SQL Capability Leaderboard for LLMs	39	Emerging	23	TypeScript
26	AKSW/LLM-KG-Bench LLM-KG-Bench is a Framework and task collection for automated benchmarking...	39	Emerging	56	Python
27	ByteDance-Seed/WideSearch WideSearch: Benchmarking Agentic Broad Info-Seeking	38	Emerging	127	Python
28	srikanth235/benchllama Benchmark your local LLMs.	38	Emerging	53	Python
29	cornell-zhang/heurigym Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization (ICLR'26)	38	Emerging	64	Python
30	mims-harvard/CUREBench CUREBench @ NeurIPS 2025: Benchmarking AI reasoning for therapeutic...	37	Emerging	129	Python
31	lavantien/llm-tournament Simple and blazingly fast dynamic evaluation platform for benchmarking Large...	36	Emerging	8	Go
32	humanlaya/OneMillion-Bench Official evals for $OneMillion-Bench	35	Emerging	32	Python
33	msu-denver/bili-core bili-core is an open-source framework for LLM benchmarking using LangChain,...	35	Emerging	9	Python
34	arthur-ai/bench A tool for evaluating LLMs	35	Emerging	428	TypeScript
35	THUNLP-MT/StableToolBench A new tool learning benchmark aiming at well-balanced stability and reality,...	35	Emerging	220	Python
36	InternScience/SGI-Bench Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows	35	Emerging	156	Python
37	rohanelukurthy/rig-rank A Go CLI tool to benchmark local LLMs via Ollama, measuring Time To First...	34	Emerging	18	Go
38	GoodAI/goodai-ltm-benchmark A library for benchmarking the Long Term Memory and Continual learning...	33	Emerging	84	HTML
39	braingpt-lovelab/BrainBench Source code for	33	Emerging	85	—
40	adobe-research/NoLiMa Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"	33	Emerging	186	Python
41	lechmazur/nyt-connections Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended...	33	Emerging	199	Python
42	IlyaGusev/ping_pong_bench A benchmark for role-playing language models	33	Emerging	116	Python
43	LiqiangJing/DSBench [ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data...	32	Emerging	108	Jupyter Notebook
44	mazzzystar/TurtleBench TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.	32	Emerging	163	Jupyter Notebook
45	SAP-samples/llm-agents-eval-tutorial Tutorial Materials for the paper "Evaluation & Benchmarking of LLM Agents: A...	32	Emerging	16	Jupyter Notebook
46	stevesolun/Chameleon 🦎 Benchmark LLM robustness under semantic paraphrasing. Tests how models...	31	Emerging	3	Python
47	ImBIOS/thiqah-ops AI SysAdmin Trust Benchmark - Comprehensive testing suite for evaluating LLM...	31	Emerging	5	TypeScript
48	gersteinlab/ML-Bench ML-Bench: Evaluating Large Language Models and Agents for Machine Learning...	31	Emerging	318	Python
49	eth-lre/mathtutorbench Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors,...	31	Emerging	32	Python
50	THUDM/AlignBench 大模型多维度中文对齐评测基准 (ACL 2024)	30	Emerging	421	Python
51	jpmorganchase/CyberBench CyberBench: A Multi-Task Cyber LLM Benchmark	30	Emerging	30	Python
52	THUDM/VisualAgentBench Towards Large Multimodal Models as Visual Foundation Agents	30	Emerging	258	Python
53	parameterlab/c-seo-bench Source code of "C-SEO Bench: Does Conversational SEO Work?" NeurIPS D&B 2025	30	Emerging	16	Jupyter Notebook
54	Q-Future/Q-Bench ①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A...	29	Experimental	282	Jupyter Notebook
55	YerbaPage/SWE-Exp SWE-Exp: Experience-Driven Software Issue Resolution	28	Experimental	37	Python
56	Laoyu84/4onebench A minimalist benchmarking tool designed to test the routine-generation...	28	Experimental	27	Python
57	ccmdi/osintbench OSINT benchmark for language models	28	Experimental	7	Python
58	TrustAIRLab/HateBench [USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated...	28	Experimental	13	—
59	terryyz/llm-benchmark A list of LLM benchmark frameworks.	28	Experimental	73	—
60	Cybonto/OllaBench Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity	28	Experimental	5	Jupyter Notebook
61	ma-compbio/DNALONGBENCH A benchmark suite of five genomics tasks for evaluating DNA foundation...	27	Experimental	8	HTML
62	ag-sc/Robo-CSK-Benchmark Benchmark for evaluating Embodied Commonsense Capabilities (e.g. of LLMs)	27	Experimental	4	Python
63	EachSheep/ShortcutsBench ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents	27	Experimental	110	Python
64	jordan-gibbs/secret-hitler-bench An LLM benchmark based on the popular social deception game, Secret Hitler....	26	Experimental	8	Python
65	ormeilu/RuCa RuCa Benchmark (pronounced "roo-ka") - Russian Tool Calling Benchmark for LLM	26	Experimental	7	Python
66	FreedomIntelligence/MTalk-Bench MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via...	26	Experimental	18	JavaScript
67	ScholarXIV/enkokilish_bench Amharic Riddle Benchmark for LLMs	26	Experimental	25	Svelte
68	OpenGVLab/Multi-Modality-Arena Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to...	26	Experimental	557	Python
69	ApplyU-ai/ColorBlindnessEval ColorBlindnessEval: Can Vision Language Models Pass Color Blindness Tests?	26	Experimental	4	—
70	research-outcome/LLM-Game-Benchmark Evaluating Large Language Models with Grid-Based Game Competitions: An...	25	Experimental	24	JavaScript
71	Swival/calibra A benchmarking harness for coding agents.	25	Experimental	3	Python
72	mnbplus/llm-gateway-bench CLI benchmark suite for LLM providers and OpenAI-compatible gateways....	25	Experimental	3	Python
73	TheDuckAI/arb Advanced Reasoning Benchmark Dataset for LLMs	24	Experimental	47	TypeScript
74	zjunlp/ChineseHarm-bench ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark	24	Experimental	49	Python
75	EternityYW/RUPBench RUPBench: Benchmarking Reasoning Under Perturbations for Robustness...	24	Experimental	4	Jupyter Notebook
76	SpiritsYouthHarmony/awesome-llm-physics-benchmarks A curated list of benchmarks for evaluating LLMs on physics reasoning and...	23	Experimental	1	—
77	stefan-ctrl/mbdd-enhanced github.com/google-research/google-research/tree/master/mbpp enhanced	23	Experimental	1	Python
78	umayer16/VIBEBENCH An automated framework for holistic evaluation of LLM-generated code using...	23	Experimental	1	Python
79	wgyhhhh/EASE About Official repository for "Towards Real-Time Fake News Detection under...	23	Experimental	8	Python
80	ChutaVeias/thiqah-ops 🤖 Evaluate AI competence in sysadmin tasks with ThiqahOps, a benchmark suite...	23	Experimental	1	TypeScript
81	ArbitrHq/ocr-mini-bench Official OCR mini-bench repository for public use.	22	Experimental	—	TypeScript
82	wimi321/task-bundle Turn AI coding runs into portable, replayable, benchmark-ready task bundles.	22	Experimental	—	TypeScript
83	Tyan3001/swe-probe SWE-Probe: A benchmark for measuring LLM cue-sensitivity in software...	22	Experimental	—	Python
84	zihao-ai/EARBench Benchmarking Physical Risk Awareness of Foundation Model-based Embodied AI Agents	22	Experimental	23	Python
85	CAS-SIAT-XinHai/CPsyExam [COLING 2025] CPsyExam: A Chinese Benchmark for Evaluating Psychology using...	22	Experimental	7	Python
86	MarcT0K/TOSSS-LLM-Benchmark TOSSS, an extensible LLM security benchmark based on the CVE database	22	Experimental	—	Python
87	marcosgarciadata/llm-performance-benchmarker Standardized benchmarking suite for evaluating Large Language Model latency,...	22	Experimental	—	JavaScript
88	KandyBoi1/enkokilish_bench 🧩 Benchmark LLMs on their ability to solve Amharic riddles using Evalite for...	22	Experimental	—	Svelte
89	zzhiyuann/agent-bench Benchmarking framework for AI agents — pytest for AI agents. Define tasks in...	22	Experimental	—	TypeScript
90	michaelabrt/clarte-benchmark Paired A/B benchmark suite for Clarté - measures how dependency-graph...	22	Experimental	—	TypeScript
91	hra42/krites LLM benchmark platform comparing models with real-time streaming, metrics,...	22	Experimental	—	Go
92	Boopi7/brain-bench Source code for	21	Experimental	16	TypeScript
93	stalkermustang/llm-bulls-and-cows-benchmark A mini-framework for evaluating LLM performance on the Bulls and Cows number...	21	Experimental	237	HTML
94	nttmdlab-nlp/ToMATO ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking...	21	Experimental	19	Python
95	dylan-slack/Tablet The TABLET benchmark for evaluating instruction learning with LLMs for...	21	Experimental	25	Python
96	caixd-220529/LifelongAgentBench Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners"	20	Experimental	80	Python
97	VTSTech/VTSTech-GPTBench Benchmark Ollama Models for Instruction Following, Tool Calling and Agent Workflows	20	Experimental	1	Python
98	oaimli/SciTrek Benchmarking long-context reasoning on scientific articles	20	Experimental	1	Python
99	NLP-Final-Projects/citation-benchmark A benchmark and evaluation pipeline for citation-aware text generation, with...	19	Experimental	—	Jupyter Notebook
100	HSTRG1/GHOST_benchmarks A collection of hardware Trojans (HTs) automatically generated by Large...	19	Experimental	11	—
101	contactvaibhavi/GVR-Bench Pipeline to investigate structured reasoning and instruction adherence in...	19	Experimental	—	Python
102	Mr-Dark-debug/RetardBench RetardBench is an open, no-censorship benchmark that ranks large language...	19	Experimental	—	TypeScript
103	IAAR-Shanghai/NewsBench [ACL 2024 Main] NewsBench: A Systematic Evaluation Framework for Assessing...	19	Experimental	34	Python
104	VisualWebBench/VisualWebBench Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs...	18	Experimental	65	Python
105	Visual-AI/GAMEBoT [ACL 2025] GAMEBoT: Transparent Assessment of LLM Reasoning in Games	16	Experimental	31	Python
106	lechmazur/generalization Thematic Generalization Benchmark: measures how effectively various LLMs can...	16	Experimental	63	—
107	lemon07r/SanityBoard Home of the SanityHarness Leaderboard website.	16	Experimental	14	HTML
108	mbeps/qwen3-italic-benchmark Benchmarking Qwen3 models f various sizes on the ITALIC benchmark to evluate...	16	Experimental	1	Jupyter Notebook
109	mbeps/mistral_italic_benchmark Benchmarking Mistral NeMo for Italian Cultural Alignment using ITALIC benchmark	16	Experimental	1	Jupyter Notebook
110	mbeps/magistral_italic_benchmark Benchmarking Magistra Small model on the ITALIC benchmark to evluate their...	16	Experimental	1	Jupyter Notebook
111	mbeps/llama_3.1_italic_benchmark Benchmarking Llama 3.1 models of various sizes on the ITALIC benchmark to...	16	Experimental	1	Jupyter Notebook
112	GAIR-NLP/benbench Benchmarking Benchmark Leakage in Large Language Models	16	Experimental	60	JavaScript
113	MSKazemi/ExaBench-QA ExaBench-QA is a benchmark and dataset for evaluating role-aware, LLM-based...	15	Experimental	—	Jupyter Notebook
114	jdleo/weirdbench Open-source LLM benchmarking site for unconventional evals, with local...	15	Experimental	1	TypeScript
115	KID-22/Cocktail Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated...	15	Experimental	15	Python
116	0xsomesh/rawbench RawBench: Powerful, minimal framework for LLM prompt evaluation with YAML...	15	Experimental	8	TypeScript
117	PrimisAI/arcbench A benchmark for evaluating advanced reasoning in language models and...	15	Experimental	6	Python
118	Antix5/ProductBench This is a benchmark to see LLMs ability to understand complex product...	15	Experimental	1	HTML
119	abronte/wordlebench WordleBench is a benchmark for evaluating LLMs on their ability to solve...	15	Experimental	1	HTML
120	JeroenVanGorsel/stock-bench Stock Bench is an LLM benchmarking system where LLMs compete in a prediction...	14	Experimental	—	Python
121	guhcostan/gym-ai-benchmark AI Benchmark for Physical Education and Gym Training Knowledge - Evaluate...	14	Experimental	—	TypeScript
122	mohiuddinshahrukh/Shahrukh_clem_IM A function induction game testing various LLMs with test functions and...	14	Experimental	—	TeX
123	zijianchen98/BioMotion_Arena [Arxiv'25] A biologically-inspired visual benchmarking approach for large models	14	Experimental	4	Python
124	pvlbzn/latai LatAI – A latency benchmarking tool for evaluating multiple generative AI...	14	Experimental	9	Go
125	JanFalkin/llmbench pprof for LLM inference. Benchmark and analyze performance of...	14	Experimental	—	Go
126	mpuodziukas-labs/llm-cobol-benchmark Systematic benchmark: top LLMs produce broken COBOL. 5 programs, 3 models,...	14	Experimental	—	COBOL
127	xInfer123/octobench Benchmark and compare LLM tool, configuration, and prompt setups using a...	14	Experimental	—	—
128	not-shivansh/AI-Bench-AI-Evaluation AI benchmarking platform using Groq (LLaMA 3.1) with hybrid NLP evaluation...	14	Experimental	—	CSS
129	Overarm-philippinecedar244/blindbench Diagnose reasoning errors in large language models using blind human voting...	14	Experimental	—	JavaScript
130	NickRiccardi/two-word-test Two Word Test: Combinatorial Semantic Benchmark for LLMs	13	Experimental	7	Jupyter Notebook
131	thejatingupta7/LLMCA 🤖 Large Language Models Acing Chartered Accountancy: Introduces CA‑Ben 📈, a...	13	Experimental	2	Python
132	Shengwei-Peng/TOCFL-MultiBench TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language...	13	Experimental	8	Python
133	francois-rd/accord Anti-faCtual COmmonsense Reasoning Disentanglement	12	Experimental	3	Python
134	dippatel1994/Large-Language-Models-Evaluation-Benchmarks-Collection This repository contains a list of benchmarks used by big orgs to evaluate...	12	Experimental	4	—
135	gqgs/llm100kbench LLM 100k portfolio management benchmark	11	Experimental	44	Go
136	husayni/gsm-u Novel benchmark for underspecified queries	11	Experimental	—	Python
137	doeunyy/pokerbench-slm-decision-making Fine-tuning small language models (≤4B) for poker decision-making under...	11	Experimental	—	Jupyter Notebook
138	alextyhwang/Chatio-LLM-Benchmark The benchmark for real-world helpfulness. Evaluating LLMs on empathy,...	11	Experimental	6	TeX
139	cloudwalk/tictactoe-dataset Filtering and ranking all of 5478 states in tic-tac-toe for efficient...	11	Experimental	2	Jupyter Notebook
140	brianpeiris/llm-basic-letter-counting-benchmark A basic letter-counting benchmark for LLMs	10	Experimental	1	TypeScript
141	kreasof-ai/infinite-benchmark-glitch We Found an Infinite Benchmark Glitch: Dynamic N-Dimensional Grid Regression...	10	Experimental	1	—

Comparisons in this category

bigcodebench and AgentBench (64 vs 55) AgentBench and LawBench (55 vs 41) AgentBench and MemoryAgentBench (55 vs 40)