Domain-Specific Benchmarks LLM Tools
Benchmarks evaluating LLMs on specialized knowledge domains (legal, OSINT, cyber, numerical reasoning, KGs) and role-playing tasks. Does NOT include general-purpose LLM evaluation, vision-language model benchmarks, or cultural alignment tests.
There are 141 domain-specific benchmarks tools tracked. 1 score above 70 (verified tier). The highest-rated is xlang-ai/OSWorld at 72/100 with 2,664 stars. 2 of the top 10 are actively maintained.
Get all 141 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=domain-specific-benchmarks&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks... |
|
Verified |
| 2 |
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI |
|
Established |
| 3 |
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment |
|
Established |
| 4 |
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24) |
|
Established |
| 5 |
swefficiency/swefficiency
Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize... |
|
Established |
| 6 |
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems |
|
Established |
| 7 |
alibaba/sec-code-bench
SecCodeBench is a benchmark suite focusing on evaluating the security of... |
|
Emerging |
| 8 |
microsoft/SWE-bench-Live
[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live! |
|
Emerging |
| 9 |
logic-star-ai/swt-bench
[NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating... |
|
Emerging |
| 10 |
principia-ai/PhysGym
A benchmark suite for evaluating LLM-based interactive scientific reasoning. |
|
Emerging |
| 11 |
OskarsEzerins/llm-benchmarks
Popular LLM benchmarks for ruby code generation |
|
Emerging |
| 12 |
MetriLLM/metrillm
Benchmark local LLM models: speed, quality, and hardware fitness scoring.... |
|
Emerging |
| 13 |
open-compass/LawBench
Benchmarking Legal Knowledge of Large Language Models |
|
Emerging |
| 14 |
Ammaar-Alam/minebench
Minecraft-style voxel benchmark for comparing AI models (Arena + Sandbox) |
|
Emerging |
| 15 |
langchain-ai/langchain-benchmarks
🦜💯 Flex those feathers! |
|
Emerging |
| 16 |
HUST-AI-HYZ/MemoryAgentBench
Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via... |
|
Emerging |
| 17 |
web-arena-x/visualwebarena
VisualWebArena is a benchmark for multimodal agents. |
|
Emerging |
| 18 |
camel-ai/crab
🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model... |
|
Emerging |
| 19 |
rentruewang/bocoel
Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate... |
|
Emerging |
| 20 |
OpenGenerativeAI/llm-colosseum
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the... |
|
Emerging |
| 21 |
zhangxjohn/LLM-Agent-Benchmark-List
A banchmark list for evaluation of large language models. |
|
Emerging |
| 22 |
OceanGPT/OceanGym
OceanGym: A Benchmark Environment for Underwater Embodied Agents |
|
Emerging |
| 23 |
X-PLUG/WritingBench
WritingBench: A Comprehensive Benchmark for Generative Writing |
|
Emerging |
| 24 |
IBM/ACPBench
ACPBench: Reasoning about Action, Change, and Planning. A benchmark... |
|
Emerging |
| 25 |
actiontech/sql-llm-benchmark
SCALE: SQL Capability Leaderboard for LLMs |
|
Emerging |
| 26 |
AKSW/LLM-KG-Bench
LLM-KG-Bench is a Framework and task collection for automated benchmarking... |
|
Emerging |
| 27 |
ByteDance-Seed/WideSearch
WideSearch: Benchmarking Agentic Broad Info-Seeking |
|
Emerging |
| 28 |
srikanth235/benchllama
Benchmark your local LLMs. |
|
Emerging |
| 29 |
cornell-zhang/heurigym
Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization (ICLR'26) |
|
Emerging |
| 30 |
mims-harvard/CUREBench
CUREBench @ NeurIPS 2025: Benchmarking AI reasoning for therapeutic... |
|
Emerging |
| 31 |
lavantien/llm-tournament
Simple and blazingly fast dynamic evaluation platform for benchmarking Large... |
|
Emerging |
| 32 |
humanlaya/OneMillion-Bench
Official evals for $OneMillion-Bench |
|
Emerging |
| 33 |
msu-denver/bili-core
bili-core is an open-source framework for LLM benchmarking using LangChain,... |
|
Emerging |
| 34 |
arthur-ai/bench
A tool for evaluating LLMs |
|
Emerging |
| 35 |
THUNLP-MT/StableToolBench
A new tool learning benchmark aiming at well-balanced stability and reality,... |
|
Emerging |
| 36 |
InternScience/SGI-Bench
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows |
|
Emerging |
| 37 |
rohanelukurthy/rig-rank
A Go CLI tool to benchmark local LLMs via Ollama, measuring Time To First... |
|
Emerging |
| 38 |
GoodAI/goodai-ltm-benchmark
A library for benchmarking the Long Term Memory and Continual learning... |
|
Emerging |
| 39 |
braingpt-lovelab/BrainBench
Source code for |
|
Emerging |
| 40 |
adobe-research/NoLiMa
Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching" |
|
Emerging |
| 41 |
lechmazur/nyt-connections
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended... |
|
Emerging |
| 42 |
IlyaGusev/ping_pong_bench
A benchmark for role-playing language models |
|
Emerging |
| 43 |
LiqiangJing/DSBench
[ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data... |
|
Emerging |
| 44 |
mazzzystar/TurtleBench
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles. |
|
Emerging |
| 45 |
SAP-samples/llm-agents-eval-tutorial
Tutorial Materials for the paper "Evaluation & Benchmarking of LLM Agents: A... |
|
Emerging |
| 46 |
stevesolun/Chameleon
🦎 Benchmark LLM robustness under semantic paraphrasing. Tests how models... |
|
Emerging |
| 47 |
ImBIOS/thiqah-ops
AI SysAdmin Trust Benchmark - Comprehensive testing suite for evaluating LLM... |
|
Emerging |
| 48 |
gersteinlab/ML-Bench
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning... |
|
Emerging |
| 49 |
eth-lre/mathtutorbench
Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors,... |
|
Emerging |
| 50 |
THUDM/AlignBench
大模型多维度中文对齐评测基准 (ACL 2024) |
|
Emerging |
| 51 |
jpmorganchase/CyberBench
CyberBench: A Multi-Task Cyber LLM Benchmark |
|
Emerging |
| 52 |
THUDM/VisualAgentBench
Towards Large Multimodal Models as Visual Foundation Agents |
|
Emerging |
| 53 |
parameterlab/c-seo-bench
Source code of "C-SEO Bench: Does Conversational SEO Work?" NeurIPS D&B 2025 |
|
Emerging |
| 54 |
Q-Future/Q-Bench
①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A... |
|
Experimental |
| 55 |
YerbaPage/SWE-Exp
SWE-Exp: Experience-Driven Software Issue Resolution |
|
Experimental |
| 56 |
Laoyu84/4onebench
A minimalist benchmarking tool designed to test the routine-generation... |
|
Experimental |
| 57 |
ccmdi/osintbench
OSINT benchmark for language models |
|
Experimental |
| 58 |
TrustAIRLab/HateBench
[USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated... |
|
Experimental |
| 59 |
terryyz/llm-benchmark
A list of LLM benchmark frameworks. |
|
Experimental |
| 60 |
Cybonto/OllaBench
Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity |
|
Experimental |
| 61 |
ma-compbio/DNALONGBENCH
A benchmark suite of five genomics tasks for evaluating DNA foundation... |
|
Experimental |
| 62 |
ag-sc/Robo-CSK-Benchmark
Benchmark for evaluating Embodied Commonsense Capabilities (e.g. of LLMs) |
|
Experimental |
| 63 |
EachSheep/ShortcutsBench
ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents |
|
Experimental |
| 64 |
jordan-gibbs/secret-hitler-bench
An LLM benchmark based on the popular social deception game, Secret Hitler.... |
|
Experimental |
| 65 |
ormeilu/RuCa
RuCa Benchmark (pronounced "roo-ka") - Russian Tool Calling Benchmark for LLM |
|
Experimental |
| 66 |
FreedomIntelligence/MTalk-Bench
MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via... |
|
Experimental |
| 67 |
ScholarXIV/enkokilish_bench
Amharic Riddle Benchmark for LLMs |
|
Experimental |
| 68 |
OpenGVLab/Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to... |
|
Experimental |
| 69 |
ApplyU-ai/ColorBlindnessEval
ColorBlindnessEval: Can Vision Language Models Pass Color Blindness Tests? |
|
Experimental |
| 70 |
research-outcome/LLM-Game-Benchmark
Evaluating Large Language Models with Grid-Based Game Competitions: An... |
|
Experimental |
| 71 |
Swival/calibra
A benchmarking harness for coding agents. |
|
Experimental |
| 72 |
mnbplus/llm-gateway-bench
CLI benchmark suite for LLM providers and OpenAI-compatible gateways.... |
|
Experimental |
| 73 |
TheDuckAI/arb
Advanced Reasoning Benchmark Dataset for LLMs |
|
Experimental |
| 74 |
zjunlp/ChineseHarm-bench
ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark |
|
Experimental |
| 75 |
EternityYW/RUPBench
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness... |
|
Experimental |
| 76 |
SpiritsYouthHarmony/awesome-llm-physics-benchmarks
A curated list of benchmarks for evaluating LLMs on physics reasoning and... |
|
Experimental |
| 77 |
stefan-ctrl/mbdd-enhanced
github.com/google-research/google-research/tree/master/mbpp enhanced |
|
Experimental |
| 78 |
umayer16/VIBEBENCH
An automated framework for holistic evaluation of LLM-generated code using... |
|
Experimental |
| 79 |
wgyhhhh/EASE
About Official repository for "Towards Real-Time Fake News Detection under... |
|
Experimental |
| 80 |
ChutaVeias/thiqah-ops
🤖 Evaluate AI competence in sysadmin tasks with ThiqahOps, a benchmark suite... |
|
Experimental |
| 81 |
ArbitrHq/ocr-mini-bench
Official OCR mini-bench repository for public use. |
|
Experimental |
| 82 |
wimi321/task-bundle
Turn AI coding runs into portable, replayable, benchmark-ready task bundles. |
|
Experimental |
| 83 |
Tyan3001/swe-probe
SWE-Probe: A benchmark for measuring LLM cue-sensitivity in software... |
|
Experimental |
| 84 |
zihao-ai/EARBench
Benchmarking Physical Risk Awareness of Foundation Model-based Embodied AI Agents |
|
Experimental |
| 85 |
CAS-SIAT-XinHai/CPsyExam
[COLING 2025] CPsyExam: A Chinese Benchmark for Evaluating Psychology using... |
|
Experimental |
| 86 |
MarcT0K/TOSSS-LLM-Benchmark
TOSSS, an extensible LLM security benchmark based on the CVE database |
|
Experimental |
| 87 |
marcosgarciadata/llm-performance-benchmarker
Standardized benchmarking suite for evaluating Large Language Model latency,... |
|
Experimental |
| 88 |
KandyBoi1/enkokilish_bench
🧩 Benchmark LLMs on their ability to solve Amharic riddles using Evalite for... |
|
Experimental |
| 89 |
zzhiyuann/agent-bench
Benchmarking framework for AI agents — pytest for AI agents. Define tasks in... |
|
Experimental |
| 90 |
michaelabrt/clarte-benchmark
Paired A/B benchmark suite for Clarté - measures how dependency-graph... |
|
Experimental |
| 91 |
hra42/krites
LLM benchmark platform comparing models with real-time streaming, metrics,... |
|
Experimental |
| 92 |
Boopi7/brain-bench
Source code for |
|
Experimental |
| 93 |
stalkermustang/llm-bulls-and-cows-benchmark
A mini-framework for evaluating LLM performance on the Bulls and Cows number... |
|
Experimental |
| 94 |
nttmdlab-nlp/ToMATO
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking... |
|
Experimental |
| 95 |
dylan-slack/Tablet
The TABLET benchmark for evaluating instruction learning with LLMs for... |
|
Experimental |
| 96 |
caixd-220529/LifelongAgentBench
Code repo for "LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners" |
|
Experimental |
| 97 |
VTSTech/VTSTech-GPTBench
Benchmark Ollama Models for Instruction Following, Tool Calling and Agent Workflows |
|
Experimental |
| 98 |
oaimli/SciTrek
Benchmarking long-context reasoning on scientific articles |
|
Experimental |
| 99 |
NLP-Final-Projects/citation-benchmark
A benchmark and evaluation pipeline for citation-aware text generation, with... |
|
Experimental |
| 100 |
HSTRG1/GHOST_benchmarks
A collection of hardware Trojans (HTs) automatically generated by Large... |
|
Experimental |
| 101 |
contactvaibhavi/GVR-Bench
Pipeline to investigate structured reasoning and instruction adherence in... |
|
Experimental |
| 102 |
Mr-Dark-debug/RetardBench
RetardBench is an open, no-censorship benchmark that ranks large language... |
|
Experimental |
| 103 |
IAAR-Shanghai/NewsBench
[ACL 2024 Main] NewsBench: A Systematic Evaluation Framework for Assessing... |
|
Experimental |
| 104 |
VisualWebBench/VisualWebBench
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs... |
|
Experimental |
| 105 |
Visual-AI/GAMEBoT
[ACL 2025] GAMEBoT: Transparent Assessment of LLM Reasoning in Games |
|
Experimental |
| 106 |
lechmazur/generalization
Thematic Generalization Benchmark: measures how effectively various LLMs can... |
|
Experimental |
| 107 |
lemon07r/SanityBoard
Home of the SanityHarness Leaderboard website. |
|
Experimental |
| 108 |
mbeps/qwen3-italic-benchmark
Benchmarking Qwen3 models f various sizes on the ITALIC benchmark to evluate... |
|
Experimental |
| 109 |
mbeps/mistral_italic_benchmark
Benchmarking Mistral NeMo for Italian Cultural Alignment using ITALIC benchmark |
|
Experimental |
| 110 |
mbeps/magistral_italic_benchmark
Benchmarking Magistra Small model on the ITALIC benchmark to evluate their... |
|
Experimental |
| 111 |
mbeps/llama_3.1_italic_benchmark
Benchmarking Llama 3.1 models of various sizes on the ITALIC benchmark to... |
|
Experimental |
| 112 |
GAIR-NLP/benbench
Benchmarking Benchmark Leakage in Large Language Models |
|
Experimental |
| 113 |
MSKazemi/ExaBench-QA
ExaBench-QA is a benchmark and dataset for evaluating role-aware, LLM-based... |
|
Experimental |
| 114 |
jdleo/weirdbench
Open-source LLM benchmarking site for unconventional evals, with local... |
|
Experimental |
| 115 |
KID-22/Cocktail
Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated... |
|
Experimental |
| 116 |
0xsomesh/rawbench
RawBench: Powerful, minimal framework for LLM prompt evaluation with YAML... |
|
Experimental |
| 117 |
PrimisAI/arcbench
A benchmark for evaluating advanced reasoning in language models and... |
|
Experimental |
| 118 |
Antix5/ProductBench
This is a benchmark to see LLMs ability to understand complex product... |
|
Experimental |
| 119 |
abronte/wordlebench
WordleBench is a benchmark for evaluating LLMs on their ability to solve... |
|
Experimental |
| 120 |
JeroenVanGorsel/stock-bench
Stock Bench is an LLM benchmarking system where LLMs compete in a prediction... |
|
Experimental |
| 121 |
guhcostan/gym-ai-benchmark
AI Benchmark for Physical Education and Gym Training Knowledge - Evaluate... |
|
Experimental |
| 122 |
mohiuddinshahrukh/Shahrukh_clem_IM
A function induction game testing various LLMs with test functions and... |
|
Experimental |
| 123 |
zijianchen98/BioMotion_Arena
[Arxiv'25] A biologically-inspired visual benchmarking approach for large models |
|
Experimental |
| 124 |
pvlbzn/latai
LatAI – A latency benchmarking tool for evaluating multiple generative AI... |
|
Experimental |
| 125 |
JanFalkin/llmbench
pprof for LLM inference. Benchmark and analyze performance of... |
|
Experimental |
| 126 |
mpuodziukas-labs/llm-cobol-benchmark
Systematic benchmark: top LLMs produce broken COBOL. 5 programs, 3 models,... |
|
Experimental |
| 127 |
xInfer123/octobench
Benchmark and compare LLM tool, configuration, and prompt setups using a... |
|
Experimental |
| 128 |
not-shivansh/AI-Bench-AI-Evaluation
AI benchmarking platform using Groq (LLaMA 3.1) with hybrid NLP evaluation... |
|
Experimental |
| 129 |
Overarm-philippinecedar244/blindbench
Diagnose reasoning errors in large language models using blind human voting... |
|
Experimental |
| 130 |
NickRiccardi/two-word-test
Two Word Test: Combinatorial Semantic Benchmark for LLMs |
|
Experimental |
| 131 |
thejatingupta7/LLMCA
🤖 Large Language Models Acing Chartered Accountancy: Introduces CA‑Ben 📈, a... |
|
Experimental |
| 132 |
Shengwei-Peng/TOCFL-MultiBench
TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language... |
|
Experimental |
| 133 |
francois-rd/accord
Anti-faCtual COmmonsense Reasoning Disentanglement |
|
Experimental |
| 134 |
dippatel1994/Large-Language-Models-Evaluation-Benchmarks-Collection
This repository contains a list of benchmarks used by big orgs to evaluate... |
|
Experimental |
| 135 |
gqgs/llm100kbench
LLM 100k portfolio management benchmark |
|
Experimental |
| 136 |
husayni/gsm-u
Novel benchmark for underspecified queries |
|
Experimental |
| 137 |
doeunyy/pokerbench-slm-decision-making
Fine-tuning small language models (≤4B) for poker decision-making under... |
|
Experimental |
| 138 |
alextyhwang/Chatio-LLM-Benchmark
The benchmark for real-world helpfulness. Evaluating LLMs on empathy,... |
|
Experimental |
| 139 |
cloudwalk/tictactoe-dataset
Filtering and ranking all of 5478 states in tic-tac-toe for efficient... |
|
Experimental |
| 140 |
brianpeiris/llm-basic-letter-counting-benchmark
A basic letter-counting benchmark for LLMs |
|
Experimental |
| 141 |
kreasof-ai/infinite-benchmark-glitch
We Found an Infinite Benchmark Glitch: Dynamic N-Dimensional Grid Regression... |
|
Experimental |