Domain-Specific Benchmarks LLM Tools

Benchmarks evaluating LLMs on specialized knowledge domains (legal, OSINT, cyber, numerical reasoning, KGs) and role-playing tasks. Does NOT include general-purpose LLM evaluation, vision-language model benchmarks, or cultural alignment tests.

There are 141 domain-specific benchmarks tools tracked. 1 score above 70 (verified tier). The highest-rated is xlang-ai/OSWorld at 72/100 with 2,664 stars. 2 of the top 10 are actively maintained.

Get all 141 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=domain-specific-benchmarks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks...

72
Verified
2 bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

64
Established
3 sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

64
Established
4 THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

55
Established
5 swefficiency/swefficiency

Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize...

51
Established
6 scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

51
Established
7 alibaba/sec-code-bench

SecCodeBench is a benchmark suite focusing on evaluating the security of...

49
Emerging
8 microsoft/SWE-bench-Live

[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!

49
Emerging
9 logic-star-ai/swt-bench

[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating...

47
Emerging
10 principia-ai/PhysGym

A benchmark suite for evaluating LLM-based interactive scientific reasoning.

43
Emerging
11 OskarsEzerins/llm-benchmarks

Popular LLM benchmarks for ruby code generation

41
Emerging
12 MetriLLM/metrillm

Benchmark local LLM models: speed, quality, and hardware fitness scoring....

41
Emerging
13 open-compass/LawBench

Benchmarking Legal Knowledge of Large Language Models

41
Emerging
14 Ammaar-Alam/minebench

Minecraft-style voxel benchmark for comparing AI models (Arena + Sandbox)

41
Emerging
15 langchain-ai/langchain-benchmarks

🦜💯 Flex those feathers!

41
Emerging
16 HUST-AI-HYZ/MemoryAgentBench

Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via...

40
Emerging
17 web-arena-x/visualwebarena

VisualWebArena is a benchmark for multimodal agents.

40
Emerging
18 camel-ai/crab

🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model...

40
Emerging
19 rentruewang/bocoel

Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate...

40
Emerging
20 OpenGenerativeAI/llm-colosseum

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the...

40
Emerging
21 zhangxjohn/LLM-Agent-Benchmark-List

A banchmark list for evaluation of large language models.

39
Emerging
22 OceanGPT/OceanGym

OceanGym: A Benchmark Environment for Underwater Embodied Agents

39
Emerging
23 X-PLUG/WritingBench

WritingBench: A Comprehensive Benchmark for Generative Writing

39
Emerging
24 IBM/ACPBench

ACPBench: Reasoning about Action, Change, and Planning. A benchmark...

39
Emerging
25 actiontech/sql-llm-benchmark

SCALE: SQL Capability Leaderboard for LLMs

39
Emerging
26 AKSW/LLM-KG-Bench

LLM-KG-Bench is a Framework and task collection for automated benchmarking...

39
Emerging
27 ByteDance-Seed/WideSearch

WideSearch: Benchmarking Agentic Broad Info-Seeking

38
Emerging
28 srikanth235/benchllama

Benchmark your local LLMs.

38
Emerging
29 cornell-zhang/heurigym

Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization (ICLR'26)

38
Emerging
30 mims-harvard/CUREBench

CUREBench @ NeurIPS 2025: Benchmarking AI reasoning for therapeutic...

37
Emerging
31 lavantien/llm-tournament

Simple and blazingly fast dynamic evaluation platform for benchmarking Large...

36
Emerging
32 humanlaya/OneMillion-Bench

Official evals for $OneMillion-Bench

35
Emerging
33 msu-denver/bili-core

bili-core is an open-source framework for LLM benchmarking using LangChain,...

35
Emerging
34 arthur-ai/bench

A tool for evaluating LLMs

35
Emerging
35 THUNLP-MT/StableToolBench

A new tool learning benchmark aiming at well-balanced stability and reality,...

35
Emerging
36 InternScience/SGI-Bench

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

35
Emerging
37 rohanelukurthy/rig-rank

A Go CLI tool to benchmark local LLMs via Ollama, measuring Time To First...

34
Emerging
38 GoodAI/goodai-ltm-benchmark

A library for benchmarking the Long Term Memory and Continual learning...

33
Emerging
39 braingpt-lovelab/BrainBench

Source code for

33
Emerging
40 adobe-research/NoLiMa

Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"

33
Emerging
41 lechmazur/nyt-connections

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended...

33
Emerging
42 IlyaGusev/ping_pong_bench

A benchmark for role-playing language models

33
Emerging
43 LiqiangJing/DSBench

[ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data...

32
Emerging
44 mazzzystar/TurtleBench

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.

32
Emerging
45 SAP-samples/llm-agents-eval-tutorial

Tutorial Materials for the paper "Evaluation & Benchmarking of LLM Agents: A...

32
Emerging
46 stevesolun/Chameleon

🦎 Benchmark LLM robustness under semantic paraphrasing. Tests how models...

31
Emerging
47 ImBIOS/thiqah-ops

AI SysAdmin Trust Benchmark - Comprehensive testing suite for evaluating LLM...

31
Emerging
48 gersteinlab/ML-Bench

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning...

31
Emerging
49 eth-lre/mathtutorbench

Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors,...

31
Emerging
50 THUDM/AlignBench

大模型多维度中文对齐评测基准 (ACL 2024)

30
Emerging
51 jpmorganchase/CyberBench

CyberBench: A Multi-Task Cyber LLM Benchmark

30
Emerging
52 THUDM/VisualAgentBench

Towards Large Multimodal Models as Visual Foundation Agents

30
Emerging
53 parameterlab/c-seo-bench

Source code of "C-SEO Bench: Does Conversational SEO Work?" NeurIPS D&B 2025

30
Emerging
54 Q-Future/Q-Bench

①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A...

29
Experimental
55 YerbaPage/SWE-Exp

SWE-Exp: Experience-Driven Software Issue Resolution

28
Experimental
56 Laoyu84/4onebench

A minimalist benchmarking tool designed to test the routine-generation...

28
Experimental
57 ccmdi/osintbench

OSINT benchmark for language models

28
Experimental
58 TrustAIRLab/HateBench

[USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated...

28
Experimental
59 terryyz/llm-benchmark

A list of LLM benchmark frameworks.

28
Experimental
60 Cybonto/OllaBench

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

28
Experimental
61 ma-compbio/DNALONGBENCH

A benchmark suite of five genomics tasks for evaluating DNA foundation...

27
Experimental
62 ag-sc/Robo-CSK-Benchmark

Benchmark for evaluating Embodied Commonsense Capabilities (e.g. of LLMs)

27
Experimental
63 EachSheep/ShortcutsBench

ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents

27
Experimental
64 jordan-gibbs/secret-hitler-bench

An LLM benchmark based on the popular social deception game, Secret Hitler....

26
Experimental
65 ormeilu/RuCa

RuCa Benchmark (pronounced "roo-ka") - Russian Tool Calling Benchmark for LLM

26
Experimental
66 FreedomIntelligence/MTalk-Bench

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via...

26
Experimental
67 ScholarXIV/enkokilish_bench

Amharic Riddle Benchmark for LLMs

26
Experimental
68 OpenGVLab/Multi-Modality-Arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to...

26
Experimental
69 ApplyU-ai/ColorBlindnessEval

ColorBlindnessEval: Can Vision Language Models Pass Color Blindness Tests?

26
Experimental
70 research-outcome/LLM-Game-Benchmark

Evaluating Large Language Models with Grid-Based Game Competitions: An...

25
Experimental
71 Swival/calibra

A benchmarking harness for coding agents.

25
Experimental
72 mnbplus/llm-gateway-bench

CLI benchmark suite for LLM providers and OpenAI-compatible gateways....

25
Experimental
73 TheDuckAI/arb

Advanced Reasoning Benchmark Dataset for LLMs

24
Experimental
74 zjunlp/ChineseHarm-bench

ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

24
Experimental
75 EternityYW/RUPBench

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness...

24
Experimental
76 SpiritsYouthHarmony/awesome-llm-physics-benchmarks

A curated list of benchmarks for evaluating LLMs on physics reasoning and...

23
Experimental
77 stefan-ctrl/mbdd-enhanced

github.com/google-research/google-research/tree/master/mbpp enhanced

23
Experimental
78 umayer16/VIBEBENCH

An automated framework for holistic evaluation of LLM-generated code using...

23
Experimental
79 wgyhhhh/EASE

About Official repository for "Towards Real-Time Fake News Detection under...

23
Experimental
80 ChutaVeias/thiqah-ops

🤖 Evaluate AI competence in sysadmin tasks with ThiqahOps, a benchmark suite...

23
Experimental
81 ArbitrHq/ocr-mini-bench

Official OCR mini-bench repository for public use.

22
Experimental
82 wimi321/task-bundle

Turn AI coding runs into portable, replayable, benchmark-ready task bundles.

22
Experimental
83 Tyan3001/swe-probe

SWE-Probe: A benchmark for measuring LLM cue-sensitivity in software...

22
Experimental
84 zihao-ai/EARBench

Benchmarking Physical Risk Awareness of Foundation Model-based Embodied AI Agents

22
Experimental
85 CAS-SIAT-XinHai/CPsyExam

[COLING 2025] CPsyExam: A Chinese Benchmark for Evaluating Psychology using...

22
Experimental
86 MarcT0K/TOSSS-LLM-Benchmark

TOSSS, an extensible LLM security benchmark based on the CVE database

22
Experimental
87 marcosgarciadata/llm-performance-benchmarker

Standardized benchmarking suite for evaluating Large Language Model latency,...

22
Experimental
88 KandyBoi1/enkokilish_bench

🧩 Benchmark LLMs on their ability to solve Amharic riddles using Evalite for...

22
Experimental
89 zzhiyuann/agent-bench

Benchmarking framework for AI agents — pytest for AI agents. Define tasks in...

22
Experimental
90 michaelabrt/clarte-benchmark

Paired A/B benchmark suite for Clarté - measures how dependency-graph...

22
Experimental
91 hra42/krites

LLM benchmark platform comparing models with real-time streaming, metrics,...

22
Experimental
92 Boopi7/brain-bench

Source code for

21
Experimental
93 stalkermustang/llm-bulls-and-cows-benchmark

A mini-framework for evaluating LLM performance on the Bulls and Cows number...

21
Experimental
94 nttmdlab-nlp/ToMATO

ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking...

21
Experimental
95 dylan-slack/Tablet

The TABLET benchmark for evaluating instruction learning with LLMs for...

21
Experimental
96 caixd-220529/LifelongAgentBench

Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners"

20
Experimental
97 VTSTech/VTSTech-GPTBench

Benchmark Ollama Models for Instruction Following, Tool Calling and Agent Workflows

20
Experimental
98 oaimli/SciTrek

Benchmarking long-context reasoning on scientific articles

20
Experimental
99 NLP-Final-Projects/citation-benchmark

A benchmark and evaluation pipeline for citation-aware text generation, with...

19
Experimental
100 HSTRG1/GHOST_benchmarks

A collection of hardware Trojans (HTs) automatically generated by Large...

19
Experimental
101 contactvaibhavi/GVR-Bench

Pipeline to investigate structured reasoning and instruction adherence in...

19
Experimental
102 Mr-Dark-debug/RetardBench

RetardBench is an open, no-censorship benchmark that ranks large language...

19
Experimental
103 IAAR-Shanghai/NewsBench

[ACL 2024 Main] NewsBench: A Systematic Evaluation Framework for Assessing...

19
Experimental
104 VisualWebBench/VisualWebBench

Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs...

18
Experimental
105 Visual-AI/GAMEBoT

[ACL 2025] GAMEBoT: Transparent Assessment of LLM Reasoning in Games

16
Experimental
106 lechmazur/generalization

Thematic Generalization Benchmark: measures how effectively various LLMs can...

16
Experimental
107 lemon07r/SanityBoard

Home of the SanityHarness Leaderboard website.

16
Experimental
108 mbeps/qwen3-italic-benchmark

Benchmarking Qwen3 models f various sizes on the ITALIC benchmark to evluate...

16
Experimental
109 mbeps/mistral_italic_benchmark

Benchmarking Mistral NeMo for Italian Cultural Alignment using ITALIC benchmark

16
Experimental
110 mbeps/magistral_italic_benchmark

Benchmarking Magistra Small model on the ITALIC benchmark to evluate their...

16
Experimental
111 mbeps/llama_3.1_italic_benchmark

Benchmarking Llama 3.1 models of various sizes on the ITALIC benchmark to...

16
Experimental
112 GAIR-NLP/benbench

Benchmarking Benchmark Leakage in Large Language Models

16
Experimental
113 MSKazemi/ExaBench-QA

ExaBench-QA is a benchmark and dataset for evaluating role-aware, LLM-based...

15
Experimental
114 jdleo/weirdbench

Open-source LLM benchmarking site for unconventional evals, with local...

15
Experimental
115 KID-22/Cocktail

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated...

15
Experimental
116 0xsomesh/rawbench

RawBench: Powerful, minimal framework for LLM prompt evaluation with YAML...

15
Experimental
117 PrimisAI/arcbench

A benchmark for evaluating advanced reasoning in language models and...

15
Experimental
118 Antix5/ProductBench

This is a benchmark to see LLMs ability to understand complex product...

15
Experimental
119 abronte/wordlebench

WordleBench is a benchmark for evaluating LLMs on their ability to solve...

15
Experimental
120 JeroenVanGorsel/stock-bench

Stock Bench is an LLM benchmarking system where LLMs compete in a prediction...

14
Experimental
121 guhcostan/gym-ai-benchmark

AI Benchmark for Physical Education and Gym Training Knowledge - Evaluate...

14
Experimental
122 mohiuddinshahrukh/Shahrukh_clem_IM

A function induction game testing various LLMs with test functions and...

14
Experimental
123 zijianchen98/BioMotion_Arena

[Arxiv'25] A biologically-inspired visual benchmarking approach for large models

14
Experimental
124 pvlbzn/latai

LatAI – A latency benchmarking tool for evaluating multiple generative AI...

14
Experimental
125 JanFalkin/llmbench

pprof for LLM inference. Benchmark and analyze performance of...

14
Experimental
126 mpuodziukas-labs/llm-cobol-benchmark

Systematic benchmark: top LLMs produce broken COBOL. 5 programs, 3 models,...

14
Experimental
127 xInfer123/octobench

Benchmark and compare LLM tool, configuration, and prompt setups using a...

14
Experimental
128 not-shivansh/AI-Bench-AI-Evaluation

AI benchmarking platform using Groq (LLaMA 3.1) with hybrid NLP evaluation...

14
Experimental
129 Overarm-philippinecedar244/blindbench

Diagnose reasoning errors in large language models using blind human voting...

14
Experimental
130 NickRiccardi/two-word-test

Two Word Test: Combinatorial Semantic Benchmark for LLMs

13
Experimental
131 thejatingupta7/LLMCA

🤖 Large Language Models Acing Chartered Accountancy: Introduces CA‑Ben 📈, a...

13
Experimental
132 Shengwei-Peng/TOCFL-MultiBench

TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language...

13
Experimental
133 francois-rd/accord

Anti-faCtual COmmonsense Reasoning Disentanglement

12
Experimental
134 dippatel1994/Large-Language-Models-Evaluation-Benchmarks-Collection

This repository contains a list of benchmarks used by big orgs to evaluate...

12
Experimental
135 gqgs/llm100kbench

LLM 100k portfolio management benchmark

11
Experimental
136 husayni/gsm-u

Novel benchmark for underspecified queries

11
Experimental
137 doeunyy/pokerbench-slm-decision-making

Fine-tuning small language models (≤4B) for poker decision-making under...

11
Experimental
138 alextyhwang/Chatio-LLM-Benchmark

The benchmark for real-world helpfulness. Evaluating LLMs on empathy,...

11
Experimental
139 cloudwalk/tictactoe-dataset

Filtering and ranking all of 5478 states in tic-tac-toe for efficient...

11
Experimental
140 brianpeiris/llm-basic-letter-counting-benchmark

A basic letter-counting benchmark for LLMs

10
Experimental
141 kreasof-ai/infinite-benchmark-glitch

We Found an Infinite Benchmark Glitch: Dynamic N-Dimensional Grid Regression...

10
Experimental