Domain Specific Benchmarks Transformer Models

There are 27 domain specific benchmarks models tracked. 1 score above 50 (established tier). The highest-rated is stanfordnlp/axbench at 50/100 with 175 stars.

Get all 27 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=domain-specific-benchmarks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM...

50
Established
2 LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with...

46
Emerging
3 aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

46
Emerging
4 qcri/LLMeBench

Benchmarking Large Language Models

43
Emerging
5 microsoft/LLF-Bench

A benchmark for evaluating learning agents based on just language feedback

38
Emerging
6 THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

38
Emerging
7 AnkitNayak-eth/llmBench

llmBench is a high-depth benchmarking tool designed to measure the raw...

38
Emerging
8 YJiangcm/FollowBench

[ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following...

38
Emerging
9 The-FinAI/CALM

A LLM training and evaluation benchmark for credit scoring

34
Emerging
10 OpenBMB/InfiniteBench

Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K...

33
Emerging
11 RedHatResearch/conext24-NetConfEval

Benchmark for evaluating LLMs in network configuration problems.

33
Emerging
12 epfml/llm-optimizer-benchmark

Benchmarking Optimizers for LLM Pretraining

32
Emerging
13 cloudmercato/ollama-benchmark

Handy tool to measure the performance and efficiency of LLMs workloads.

32
Emerging
14 rohit901/VANE-Bench

[NAACL'25] Contains code and documentation for our VANE-Bench paper.

31
Emerging
15 HiThink-Research/BizFinBench

A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

30
Emerging
16 AIFEG/BenchLMM

[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large...

28
Experimental
17 zhchen18/ToMBench

ToMBench: Benchmarking Theory of Mind in Large Language Models, ACL 2024.

28
Experimental
18 SORRY-Bench/sorry-bench

Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large...

26
Experimental
19 SapienzaNLP/ita-bench

A collection of Italian benchmarks for LLM evaluation

25
Experimental
20 deep-symbolic-mathematics/llm-srbench

[ICML2025 Oral] LLM-SRBench: A New Benchmark for Scientific Equation...

25
Experimental
21 RaptorMai/MLLM-CompBench

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs...

24
Experimental
22 EternityYW/TRAM-Benchmark

TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of...

23
Experimental
23 Open-Social-World/EgoNormia

EgoNormia | Benchmarking Physical Social Norm Understanding in VLMs

22
Experimental
24 MileBench/MileBench

This repo contains evaluation code for the paper "MileBench: Benchmarking...

22
Experimental
25 AUCOHL/RTL-Repo

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects...

22
Experimental
26 zchuz/TimeBench

The repository for ACL 2024 paper "TimeBench: A Comprehensive Evaluation of...

22
Experimental
27 PKU-YuanGroup/Video-Bench

A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large...

16
Experimental