Evaluation Frameworks Metrics Transformer Models

There are 19 evaluation frameworks metrics models tracked. 3 score above 50 (established tier). The highest-rated is eth-sri/matharena at 55/100 with 229 stars.

Get all 19 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=evaluation-frameworks-metrics&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 eth-sri/matharena

Evaluation of LLMs on latest math competitions

55
Established
2 tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models....

51
Established
3 HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

50
Established
4 nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

42
Emerging
5 JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

37
Emerging
6 princeton-nlp/LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following

37
Emerging
7 grigio/llm-eval-simple

llm-eval-simple is a simple LLM evaluation framework with intermediate...

36
Emerging
8 haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

34
Emerging
9 TIGER-AI-Lab/TIGERScore

"TIGERScore: Towards Building Explainable Metric for All Text Generation...

32
Emerging
10 waltonfuture/Diff-eRank

[NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models

31
Emerging
11 chziakas/redeval

A library for red-teaming LLM applications with LLMs.

30
Emerging
12 Praveengovianalytics/falcon-evaluate

Falcon Evaluate is an open-source Python library aims to revolutionise the...

29
Experimental
13 open-compass/Ada-LEval

The official implementation of "Ada-LEval: Evaluating long-context LLMs with...

25
Experimental
14 jiayuww/SpatialEval

[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning...

23
Experimental
15 GAIR-NLP/scaleeval

Scalable Meta-Evaluation of LLMs as Evaluators

23
Experimental
16 Praful932/llmsearch

Find better generation parameters for your LLM

20
Experimental
17 alphadl/OOP-eval

The first Object-Oriented Programming (OOP) Evaluation Benchmark for LLMs

18
Experimental
18 DigitalHarborFoundation/FlexEval

FlexEval is an LLM evaluation tool designed for practical quantitative analysis.

17
Experimental
19 UMass-Meta-LLM-Eval/llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup...

14
Experimental