Evaluation Frameworks Metrics LLM Tools

Tools for building, running, and standardizing LLM evaluation systems with multiple metrics, benchmarking pipelines, and automated scoring. Does NOT include domain-specific benchmarks (math, code, reasoning) or safety/robustness-focused evaluations.

There are 133 evaluation frameworks metrics tools tracked. 4 score above 70 (verified tier). The highest-rated is EvolvingLMMs-Lab/lmms-eval at 90/100 with 3,883 stars and 9,061 monthly downloads. 3 of the top 10 are actively maintained.

Get all 133 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=evaluation-frameworks-metrics&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	EvolvingLMMs-Lab/lmms-eval One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks	90	Verified	3,883	Python
2	open-compass/VLMEvalKit Open-source evaluation toolkit of large multi-modality models (LMMs),...	72	Verified	3,894	Python
3	Giskard-AI/giskard-oss 🐢 Open-Source Evaluation & Testing library for LLM Agents	70	Verified	5,158	Python
4	vibrantlabsai/ragas Supercharge Your LLM Application Evaluations 🚀	70	Verified	12,927	Python
5	EuroEval/EuroEval The robust European language model benchmark.	63	Established	164	Python
6	evalplus/evalplus Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024	63	Established	1,699	Python
7	parameterlab/MASEval Multi-Agent LLM Evaluation	59	Established	18	Python
8	dustalov/evalica Evalica, your favourite evaluation toolkit	58	Established	62	Python
9	mohsenhariri/scorio Statistical evaluation, comparison, and ranking of Large Language Models	54	Established	5	Python
10	DebarghaG/proofofthought Proof of thought : LLM-based reasoning using Z3 theorem proving with...	54	Established	365	Python
11	aiverify-foundation/moonshot Moonshot - A simple and modular tool to evaluate and red-team any LLM application.	51	Established	315	Python
12	sciknoworg/YESciEval YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering...	48	Emerging	10	Python
13	zli12321/qa_metrics An easy python package to run quick basic QA evaluations. This package...	48	Emerging	61	Python
14	IAAR-Shanghai/xFinder [ICLR 2025] xFinder: Large Language Models as Automated Evaluators for...	46	Emerging	180	Python
15	fiddler-labs/fiddler-auditor Fiddler Auditor is a tool to evaluate language models.	44	Emerging	189	Python
16	evo-eval/evoeval EvoEval: Evolving Coding Benchmarks via LLM	43	Emerging	81	Python
17	huggingface/evaluation-guidebook Sharing both practical insights and theoretical knowledge about LLM...	42	Emerging	2,075	Jupyter Notebook
18	InternScience/SciEvalKit A unified evaluation toolkit and leaderboard for rigorously assessing the...	42	Emerging	74	Python
19	lean-dojo/ReProver Retrieval-Augmented Theorem Provers for Lean	42	Emerging	318	Python
20	kieranklaassen/leva LLM Evaluation Framework for Rails apps to be used with production data.	41	Emerging	133	HTML
21	mlchrzan/pairadigm Concept-Guided Chain-of-Thought (CGCoT) pairwise annotation tool for...	40	Emerging	4	Jupyter Notebook
22	SeekingDream/Static-to-Dynamic-LLMEval The official GitHub repository of the paper "Recent advances in large...	39	Emerging	547	—
23	ShuntaroOkuma/adapt-gauge-core Measure LLM adaptation efficiency — how fast models learn from few examples	38	Emerging	5	Python
24	bowen-upenn/PersonaMem [COLM 2025] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User...	38	Emerging	119	Python
25	prometheus-eval/prometheus-eval Evaluate your LLM's response with Prometheus and GPT4 💯	37	Emerging	1,051	Python
26	IS2Lab/S-Eval S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large...	37	Emerging	111	—
27	ai-twinkle/Eval Twinkle Eval：高效且準確的 AI 評測工具	37	Emerging	89	Python
28	alopatenko/LLMEvaluation A comprehensive guide to LLM evaluation methods designed to assist in...	36	Emerging	181	HTML
29	flexpa/llm-fhir-eval Benchmarking Large Language Models for FHIR	36	Emerging	42	TypeScript
30	ai4society/GenAIResultsComparator A Python library providing evaluation metrics to compare generated texts...	36	Emerging	6	Python
31	multinear/multinear Develop reliable AI apps	35	Emerging	44	Python
32	HiThink-Research/GAGE General AI evaluation and Gauge Engine. A unified evaluation engine for...	35	Emerging	42	Python
33	OpenDCAI/One-Eval Automated system for LLM evaluation via agents.	35	Emerging	24	Python
34	FastEval/FastEval Fast & more realistic evaluation of chat language models. Includes leaderboard.	35	Emerging	189	Python
35	langwatch/langevals LangEvals aggregates various language model evaluators into a single...	34	Emerging	71	—
36	VikhrModels/ru_llm_arena Modified Arena-Hard-Auto LLM evaluation toolkit with an emphasis on Russian language	34	Emerging	47	Python
37	namin/llm-verified-with-monte-carlo-tree-search LLM verified with Monte Carlo Tree Search	34	Emerging	289	Jupyter Notebook
38	root-signals/scorable-sdk Scorable SDK	33	Emerging	13	Python
39	IAAR-Shanghai/UHGEval [ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks:...	33	Emerging	180	Python
40	mims-harvard/Qworld Qworld: Question-Specific Evaluation Criteria for LLMs	33	Emerging	20	Python
41	RGGH/evaluate Evaluate - The Robust LLM Testing Framework 🦀	32	Emerging	7	Rust
42	lmarena/search-arena ⚔️ [ICLR 2026] Official code of "Search Arena: Analyzing Search-Augmented LLMs".	32	Emerging	53	Jupyter Notebook
43	wgryc/phasellm Large language model evaluation and workflow framework from Phase AI.	32	Emerging	460	Python
44	superagent-ai/poker-eval A comprehensive tool for assessing AI Agents performance in simulated poker...	31	Emerging	21	TypeScript
45	terryyz/ice-score [EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code	31	Emerging	80	Python
46	pyladiesams/eval-llm-based-apps-jan2025 Create an evaluation framework for your LLM based app. Incorporate it into...	31	Emerging	8	Jupyter Notebook
47	MLGroupJLU/LLM-eval-survey The official GitHub page for the survey paper "A Survey on Evaluation of...	30	Emerging	1,591	—
48	franckalbinet/evaluatr Streamline policy evaluation workflows with AI-driven analysis and...	30	Emerging	2	Jupyter Notebook
49	sileod/llm-theory-of-mind Testing Theory of Mind (ToM) in language models with epistemic logic	29	Experimental	22	Python
50	gordicaleksa/serbian-llm-eval Serbian LLM Eval.	29	Experimental	97	Python
51	ZeroSumEval/ZeroSumEval A framework for pitting LLMs against each other in an evolving library of games ⚔	29	Experimental	35	Python
52	Cohere-Labs/multilingual-llm-evaluation-checklist mLLM evaluation checklist	28	Experimental	5	—
53	CS-EVAL/CS-Eval CS-Eval is a comprehensive evaluation suite for fundamental cybersecurity...	28	Experimental	60	—
54	MisterBrookT/Scorpio SCORPIO is a system-algorithm co-designed LLM serving engine that...	28	Experimental	7	Jupyter Notebook
55	PeytonCleveland/Darwin Implementation of prompt evolution based on Evol-Instruct	28	Experimental	4	Python
56	IAAR-Shanghai/GuessArena [ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for...	28	Experimental	9	Python
57	Re-Align/just-eval A simple GPT-based evaluation tool for multi-aspect, interpretable...	28	Experimental	90	Python
58	zorse-project/COBOLEval Evaluate LLM-generated COBOL	27	Experimental	43	Python
59	Contextualist/lone-arena Self-hosted LLM chatbot arena, with yourself as the only judge	27	Experimental	41	Python
60	sinanuozdemir/oreilly-evaluating-llms Metrics, Benchmarks, and Practical Tools for Assessing Large Language Models	27	Experimental	26	—
61	AMDResearch/NPUEval NPUEval is an LLM evaluation dataset written specifically to target AIE...	26	Experimental	30	C++
62	GURPREETKAURJETHRA/LLMs-Evaluation LLMs Evaluation	26	Experimental	3	Jupyter Notebook
63	epam/ai-dial-rag-eval A python library designed for RAG (Retrieval-Augmented Generation)...	26	Experimental	5	Python
64	Azure-Samples/llm-eval-grader-samples Framework for Post-production Evaluation of LLM based ChatBots	26	Experimental	5	Python
65	mankinds/mankinds-eval Open-source Python library for evaluating AI systems	25	Experimental	3	Python
66	mags0ft/hle-eval-ollama An easy-to-use evaluation tool for running Humanity's Last Exam on (locally)...	25	Experimental	4	Python
67	claw-eval/claw-eval Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks...	25	Experimental	68	Python
68	ElevenLiy/MATEval MATEval is the first multi-agent framework simulating human collaborative...	25	Experimental	28	Python
69	mit-ll-ai-technology/llm-sandbox Large language model evaluation framework for logic and open-ended Q&A with...	25	Experimental	1	Jupyter Notebook
70	GAI-Community/GraphOmni Enable Comprehensive LLM Evaluation on Graph Reasoning	24	Experimental	76	Python
71	vienneraphael/layton-eval layton-eval is an AI eval benchmark for divergent, out-of-the-box and...	24	Experimental	1	JavaScript
72	allenai/CommonGen-Eval Evaluating LLMs with CommonGen-Lite	24	Experimental	95	Python
73	kaistAI/FLASK [ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on...	24	Experimental	217	Python
74	telekom/llm_evaluation_results LLM evaluation results	24	Experimental	4	Jupyter Notebook
75	aws-samples/model-as-a-judge-eval Notebooks for evaluating LLM based applications using the Model (LLM) as a...	24	Experimental	3	Jupyter Notebook
76	Ryota-Kawamura/Evaluating-and-Debugging-Generative-AI Machine learning and AI projects require managing diverse data sources, vast...	24	Experimental	7	Jupyter Notebook
77	Goodeye-Labs/truesight-docs Official documentation for Truesight — an AI evaluation platform for scoring...	23	Experimental	1	—
78	evalkit/evalkit The TypeScript LLM Evaluation Library	23	Experimental	155	TypeScript
79	Aysnc-Labs/llm-eval A PHP package for evaluating LLM outputs. Test your prompts, validate...	23	Experimental	1	PHP
80	jacobkandel/llm-content-moderation-analysis Open-Source benchmark tracking LLM censorship and content moderation bias...	23	Experimental	1	HTML
81	prorok9898/ERR-EVAL 🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty...	23	Experimental	1	Python
82	Humanity-s-Last-Code-Exam/HLCE (EMNLP 2025 Findings) Source Evaluation scripts for Humanity's Last Code Exam	22	Experimental	95	Python
83	hitz-zentroa/latxa Latxa: An Open Language Model and Evaluation Suite for Basque	22	Experimental	32	Shell
84	IngestAI/deepmark Deepmark AI enables a unique testing environment for language models (LLM)...	22	Experimental	104	PHP
85	McTosh1/modal-llm-evaluator ⚡ Evaluate LLM prompts at scale with fast, parallel execution, real-time...	22	Experimental	—	Python
86	AntGamerMD21/eval-guide 📊 Explore ML evaluation metrics through interactive notebooks with pre-run...	22	Experimental	—	Jupyter Notebook
87	psandhaas/evaLLM QA framework for evaluating LLM outputs based on user-defined metrics	22	Experimental	—	Python
88	hnshah/verdict LLM eval framework. Compare any model via OpenAI-compatible API.	22	Experimental	—	TypeScript
89	broomva/nous Metacognitive evaluation — real-time quality scoring with inline heuristics...	22	Experimental	—	Rust
90	wahhyun/llm-eval Evaluate large language models with tools for performance and consistency...	22	Experimental	—	C++
91	Linlichinese/rail-score 🚀 Enable accurate assessment of AI models with the RAIL Score Python SDK,...	22	Experimental	—	Python
92	brucewlee/nutcracker Large Model Evaluation Experiments	22	Experimental	7	Python
93	horde-research/horde-common Shared scripts for offline Kazakh LLM eval—run inference, auto-score, and...	22	Experimental	—	Jupyter Notebook
94	deshwalmahesh/PHUDGE Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your...	22	Experimental	52	Jupyter Notebook
95	linhaowei1/kumo ☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models	22	Experimental	19	Jupyter Notebook
96	franckalbinet/iomeval Streamline evaluation evidence mapping at scale with LLMs	20	Experimental	1	Jupyter Notebook
97	hparreao/Awesome-AI-Evaluation-Guide A comprehensive, implementation-focused guide to evaluating Large Language...	20	Experimental	11	—
98	vjroy/routeeval RouteEval: A benchmark for evaluating LLM tool calling in running route...	19	Experimental	—	TeX
99	spenceryonce/LLMeval Evaluate and compare large language models (LLMs) for chatbot applications,...	19	Experimental	11	Python
100	lechmazur/sycophancy LLM benchmark and leaderboard for narrator-bias sycophancy,...	19	Experimental	13	—
101	AkhileshMalthi/llm-eval-framework A production-grade framework for evaluating Large Language Model (LLM)...	19	Experimental	—	Python
102	AtomEcho/AtomBulb 旨在对当前主流LLM进行一个直观、具体、标准的评测	18	Experimental	94	—
103	david-xander/measuring-llm-knowledge How much does an LLM know about my programming language?	16	Experimental	—	Jupyter Notebook
104	framersai/promptmachine-eval LLM evaluation framework with ELO ratings, arena battles, and benchmark testing	16	Experimental	1	Python
105	LeonEricsson/llmjudge Exploring limitations of LLM-as-a-judge	15	Experimental	20	Jupyter Notebook
106	Vibhanshu-555/Human-Aligned-LLM-Evaluation-Audit A data-driven audit of AI judge reliability using MT-Bench human...	15	Experimental	—	HTML
107	OleksandrZadvornyi/prompt-engineering An automated evaluation framework for assessing the credibility of...	15	Experimental	—	HTML
108	BhuvanDontha/YouTube-policy-enforcement-auditor Independent YouTube evaluation framework for content policy classification....	15	Experimental	1	Python
109	jaaack-wang/multi-problem-eval-llm Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing...	14	Experimental	3	Jupyter Notebook
110	djador13/moderatefocus 🔍 Analyze community moderation and platform policies with the ModerateFocus...	14	Experimental	—	Python
111	sanand0/llmmath How good are LLMs at mental math? An evaluation across 50 models from...	14	Experimental	—	JavaScript
112	CSLiJT/awesome-lm-evaluation-methodologies Frontier papers in the evaluation methodologies of language models.	14	Experimental	10	—
113	Theepankumargandhi/llm-annotation-quality-pipeline Production-grade pipeline for validating annotation consistency and...	14	Experimental	—	Python
114	serhiismetanskyi/llm-output-evaluation-with-deepeval DeepEval LLM quality evaluation tests with LLM-as-a-judge	14	Experimental	—	Python
115	MukundaKatta/redpill The Red Pill Test — Can LLMs recognize the boundaries of their own reality?...	14	Experimental	—	Python
116	nicolay-r/RuSentRel-Leaderboard This is an official Leaderboard for the RuSentRel-1.1 dataset originally...	13	Experimental	8	Python
117	vakyansh/truthfulqa_indic Truthfulqa_indic, Available in Hindi, Punjabi, Kannada, Tamil and Telugu	12	Experimental	4	—
118	giuliano-t/llm-financial-regulatory-auditor A structured evaluation pipeline for LLM-generated outputs in financial...	12	Experimental	1	Jupyter Notebook
119	crux82/wikigame-llm-eval Companion repo for CLiC-it 2025 paper on WikiGame. Reproducible pipeline to...	12	Experimental	1	Python
120	Yifan-Song793/GoodBadGreedy The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore...	12	Experimental	30	Python
121	dustalov/llmfao Large Language Model Feedback Analysis and Optimization (LLMFAO)	12	Experimental	3	Jupyter Notebook
122	JinjieNi/MixEval-X The official github repo for MixEval-X, the first any-to-any, real-world benchmark.	12	Experimental	16	Python
123	grgong/agent-exam-model-eval Agent exam built from Posit’s model-eval R LLM benchmark (baseline snapshot...	11	Experimental	—	JavaScript
124	2pa4ul2/Easygen-v2 Exam Generation With Large Language Model (LLMs)	11	Experimental	—	Python
125	The-Learning-Algorithm/ai-judge-pipeline A comprehensive pipeline for generating, analyzing, and evaluating models...	11	Experimental	—	Python
126	DavidShableski/llm-evaluation-framework A production-grade platform to evaluate and compare the performance of Large...	11	Experimental	—	TypeScript
127	arjunpatel7/alakazam-vgc An LLM powered speed check assistant for Pokemon VGC Players	11	Experimental	2	Python
128	user1342/conjecture Evaluating the likelihood of data points in a LLM's training set	11	Experimental	2	Python
129	krisstallenberg/evaluating-annotations This repository holds code to annotate textual data using LLMs, and...	11	Experimental	2	Jupyter Notebook
130	SouravD-Me/LLM-Evaluation-Dashboard A Visual Dashboard for Fundamental Benchmarking of LLMs	10	Experimental	1	Jupyter Notebook
131	prabdeb/agenteval-sample AgentEval (AutoGen 0.4) Sample Implementation	10	Experimental	1	Jupyter Notebook
132	AYUSH27112021/GENERATIVE-IMAGE-COMPARISION Different Evaluation Metrics for Image Generation Models	10	Experimental	1	Jupyter Notebook
133	franciellevargas/MFTCXplain MFTCXplain is the first multilingual benchmark dataset designed to evaluate...	10	Experimental	3	Jupyter Notebook

Comparisons in this category

lmms-eval and VLMEvalKit (90 vs 72) VLMEvalKit and evaluation-guidebook (72 vs 42) lmms-eval and evaluation-guidebook (90 vs 42) VLMEvalKit and evalplus (72 vs 63) giskard-oss and MASEval (70 vs 59) VLMEvalKit and SciEvalKit (72 vs 42) lmms-eval and MASEval (90 vs 59)