Domain Specific Benchmarks Transformer Models

There are 27 domain specific benchmarks models tracked. 1 score above 50 (established tier). The highest-rated is stanfordnlp/axbench at 50/100 with 175 stars.

Get all 27 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=domain-specific-benchmarks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	stanfordnlp/axbench Stanford NLP Python library for benchmarking the utility of LLM...	50	Established	175	Python
2	LarHope/ollama-benchmark Ollama based Benchmark with detail I/O token per second. Python with...	46	Emerging	45	Python
3	aidatatools/ollama-benchmark LLM Benchmark for Throughput via Ollama (Local LLMs)	46	Emerging	345	Python
4	qcri/LLMeBench Benchmarking Large Language Models	43	Emerging	105	Python
5	microsoft/LLF-Bench A benchmark for evaluating learning agents based on just language feedback	38	Emerging	95	Python
6	THUDM/LongBench LongBench v2 and LongBench (ACL 25'&24')	38	Emerging	1,113	Python
7	AnkitNayak-eth/llmBench llmBench is a high-depth benchmarking tool designed to measure the raw...	38	Emerging	24	Python
8	YJiangcm/FollowBench [ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following...	38	Emerging	119	Python
9	The-FinAI/CALM A LLM training and evaluation benchmark for credit scoring	34	Emerging	65	Python
10	OpenBMB/InfiniteBench Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K...	33	Emerging	378	Python
11	RedHatResearch/conext24-NetConfEval Benchmark for evaluating LLMs in network configuration problems.	33	Emerging	34	Python
12	epfml/llm-optimizer-benchmark Benchmarking Optimizers for LLM Pretraining	32	Emerging	56	Python
13	cloudmercato/ollama-benchmark Handy tool to measure the performance and efficiency of LLMs workloads.	32	Emerging	76	Python
14	rohit901/VANE-Bench [NAACL'25] Contains code and documentation for our VANE-Bench paper.	31	Emerging	23	Python
15	HiThink-Research/BizFinBench A Business-Driven Real-World Financial Benchmark for Evaluating LLMs	30	Emerging	211	Python
16	AIFEG/BenchLMM [ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large...	28	Experimental	86	Python
17	zhchen18/ToMBench ToMBench: Benchmarking Theory of Mind in Large Language Models, ACL 2024.	28	Experimental	66	Python
18	SORRY-Bench/sorry-bench Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large...	26	Experimental	77	Jupyter Notebook
19	SapienzaNLP/ita-bench A collection of Italian benchmarks for LLM evaluation	25	Experimental	37	Python
20	deep-symbolic-mathematics/llm-srbench [ICML2025 Oral] LLM-SRBench: A New Benchmark for Scientific Equation...	25	Experimental	94	Python
21	RaptorMai/MLLM-CompBench [NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs...	24	Experimental	44	Jupyter Notebook
22	EternityYW/TRAM-Benchmark TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of...	23	Experimental	26	Jupyter Notebook
23	Open-Social-World/EgoNormia EgoNormia \| Benchmarking Physical Social Norm Understanding in VLMs	22	Experimental	12	Jupyter Notebook
24	MileBench/MileBench This repo contains evaluation code for the paper "MileBench: Benchmarking...	22	Experimental	36	Python
25	AUCOHL/RTL-Repo RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects...	22	Experimental	34	Python
26	zchuz/TimeBench The repository for ACL 2024 paper "TimeBench: A Comprehensive Evaluation of...	22	Experimental	34	Python
27	PKU-YuanGroup/Video-Bench A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large...	16	Experimental	138	Python

Comparisons in this category

ollama-benchmark and llmBench (46 vs 38) LLMeBench and LLF-Bench (43 vs 38) LLMeBench and llm-optimizer-benchmark (43 vs 32) llmBench and ollama-benchmark (38 vs 32)