Domain Specific Benchmarks Transformer Models
There are 27 domain specific benchmarks models tracked. 1 score above 50 (established tier). The highest-rated is stanfordnlp/axbench at 50/100 with 175 stars.
Get all 27 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=domain-specific-benchmarks&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM... |
|
Established |
| 2 |
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with... |
|
Emerging |
| 3 |
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs) |
|
Emerging |
| 4 |
qcri/LLMeBench
Benchmarking Large Language Models |
|
Emerging |
| 5 |
microsoft/LLF-Bench
A benchmark for evaluating learning agents based on just language feedback |
|
Emerging |
| 6 |
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24') |
|
Emerging |
| 7 |
AnkitNayak-eth/llmBench
llmBench is a high-depth benchmarking tool designed to measure the raw... |
|
Emerging |
| 8 |
YJiangcm/FollowBench
[ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following... |
|
Emerging |
| 9 |
The-FinAI/CALM
A LLM training and evaluation benchmark for credit scoring |
|
Emerging |
| 10 |
OpenBMB/InfiniteBench
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K... |
|
Emerging |
| 11 |
RedHatResearch/conext24-NetConfEval
Benchmark for evaluating LLMs in network configuration problems. |
|
Emerging |
| 12 |
epfml/llm-optimizer-benchmark
Benchmarking Optimizers for LLM Pretraining |
|
Emerging |
| 13 |
cloudmercato/ollama-benchmark
Handy tool to measure the performance and efficiency of LLMs workloads. |
|
Emerging |
| 14 |
rohit901/VANE-Bench
[NAACL'25] Contains code and documentation for our VANE-Bench paper. |
|
Emerging |
| 15 |
HiThink-Research/BizFinBench
A Business-Driven Real-World Financial Benchmark for Evaluating LLMs |
|
Emerging |
| 16 |
AIFEG/BenchLMM
[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large... |
|
Experimental |
| 17 |
zhchen18/ToMBench
ToMBench: Benchmarking Theory of Mind in Large Language Models, ACL 2024. |
|
Experimental |
| 18 |
SORRY-Bench/sorry-bench
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large... |
|
Experimental |
| 19 |
SapienzaNLP/ita-bench
A collection of Italian benchmarks for LLM evaluation |
|
Experimental |
| 20 |
deep-symbolic-mathematics/llm-srbench
[ICML2025 Oral] LLM-SRBench: A New Benchmark for Scientific Equation... |
|
Experimental |
| 21 |
RaptorMai/MLLM-CompBench
[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs... |
|
Experimental |
| 22 |
EternityYW/TRAM-Benchmark
TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of... |
|
Experimental |
| 23 |
Open-Social-World/EgoNormia
EgoNormia | Benchmarking Physical Social Norm Understanding in VLMs |
|
Experimental |
| 24 |
MileBench/MileBench
This repo contains evaluation code for the paper "MileBench: Benchmarking... |
|
Experimental |
| 25 |
AUCOHL/RTL-Repo
RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects... |
|
Experimental |
| 26 |
zchuz/TimeBench
The repository for ACL 2024 paper "TimeBench: A Comprehensive Evaluation of... |
|
Experimental |
| 27 |
PKU-YuanGroup/Video-Bench
A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large... |
|
Experimental |