Evaluation Frameworks Metrics Transformer Models

There are 19 evaluation frameworks metrics models tracked. 3 score above 50 (established tier). The highest-rated is eth-sri/matharena at 55/100 with 229 stars.

Get all 19 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=evaluation-frameworks-metrics&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	eth-sri/matharena Evaluation of LLMs on latest math competitions	55	Established	229	Python
2	tatsu-lab/alpaca_eval An automatic evaluator for instruction-following language models....	51	Established	1,957	Jupyter Notebook
3	HPAI-BSC/TuRTLe TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)	50	Established	40	Python
4	nlp-uoregon/mlmm-evaluation Multilingual Large Language Models Evaluation Benchmark	42	Emerging	132	Python
5	JinjieNi/MixEval The official evaluation suite and dynamic data release for MixEval.	37	Emerging	255	Python
6	princeton-nlp/LLMBar [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following	37	Emerging	137	Python
7	grigio/llm-eval-simple llm-eval-simple is a simple LLM evaluation framework with intermediate...	36	Emerging	59	Python
8	haesleinhuepf/human-eval-bia Benchmarking Large Language Models for Bio-Image Analysis Code Generation	34	Emerging	25	Jupyter Notebook
9	TIGER-AI-Lab/TIGERScore "TIGERScore: Towards Building Explainable Metric for All Text Generation...	32	Emerging	32	Jupyter Notebook
10	waltonfuture/Diff-eRank [NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models	31	Emerging	57	Python
11	chziakas/redeval A library for red-teaming LLM applications with LLMs.	30	Emerging	29	Python
12	Praveengovianalytics/falcon-evaluate Falcon Evaluate is an open-source Python library aims to revolutionise the...	29	Experimental	14	Python
13	open-compass/Ada-LEval The official implementation of "Ada-LEval: Evaluating long-context LLMs with...	25	Experimental	56	Python
14	jiayuww/SpatialEval [NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning...	23	Experimental	59	Python
15	GAIR-NLP/scaleeval Scalable Meta-Evaluation of LLMs as Evaluators	23	Experimental	43	Python
16	Praful932/llmsearch Find better generation parameters for your LLM	20	Experimental	27	Python
17	alphadl/OOP-eval The first Object-Oriented Programming (OOP) Evaluation Benchmark for LLMs	18	Experimental	27	Python
18	DigitalHarborFoundation/FlexEval FlexEval is an LLM evaluation tool designed for practical quantitative analysis.	17	Experimental	16	Python
19	UMass-Meta-LLM-Eval/llm_eval A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup...	14	Experimental	9	Python