Evaluation Frameworks Metrics Transformer Models
There are 19 evaluation frameworks metrics models tracked. 3 score above 50 (established tier). The highest-rated is eth-sri/matharena at 55/100 with 229 stars.
Get all 19 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=evaluation-frameworks-metrics&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
eth-sri/matharena
Evaluation of LLMs on latest math competitions |
|
Established |
| 2 |
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models.... |
|
Established |
| 3 |
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025) |
|
Established |
| 4 |
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark |
|
Emerging |
| 5 |
JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval. |
|
Emerging |
| 6 |
princeton-nlp/LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following |
|
Emerging |
| 7 |
grigio/llm-eval-simple
llm-eval-simple is a simple LLM evaluation framework with intermediate... |
|
Emerging |
| 8 |
haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation |
|
Emerging |
| 9 |
TIGER-AI-Lab/TIGERScore
"TIGERScore: Towards Building Explainable Metric for All Text Generation... |
|
Emerging |
| 10 |
waltonfuture/Diff-eRank
[NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models |
|
Emerging |
| 11 |
chziakas/redeval
A library for red-teaming LLM applications with LLMs. |
|
Emerging |
| 12 |
Praveengovianalytics/falcon-evaluate
Falcon Evaluate is an open-source Python library aims to revolutionise the... |
|
Experimental |
| 13 |
open-compass/Ada-LEval
The official implementation of "Ada-LEval: Evaluating long-context LLMs with... |
|
Experimental |
| 14 |
jiayuww/SpatialEval
[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning... |
|
Experimental |
| 15 |
GAIR-NLP/scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators |
|
Experimental |
| 16 |
Praful932/llmsearch
Find better generation parameters for your LLM |
|
Experimental |
| 17 |
alphadl/OOP-eval
The first Object-Oriented Programming (OOP) Evaluation Benchmark for LLMs |
|
Experimental |
| 18 |
DigitalHarborFoundation/FlexEval
FlexEval is an LLM evaluation tool designed for practical quantitative analysis. |
|
Experimental |
| 19 |
UMass-Meta-LLM-Eval/llm_eval
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup... |
|
Experimental |