Evaluation Frameworks Metrics LLM Tools

Tools for building, running, and standardizing LLM evaluation systems with multiple metrics, benchmarking pipelines, and automated scoring. Does NOT include domain-specific benchmarks (math, code, reasoning) or safety/robustness-focused evaluations.

There are 133 evaluation frameworks metrics tools tracked. 4 score above 70 (verified tier). The highest-rated is EvolvingLMMs-Lab/lmms-eval at 90/100 with 3,883 stars and 9,061 monthly downloads. 3 of the top 10 are actively maintained.

Get all 133 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=evaluation-frameworks-metrics&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

90
Verified
2 open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs),...

72
Verified
3 Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

70
Verified
4 vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

70
Verified
5 EuroEval/EuroEval

The robust European language model benchmark.

63
Established
6 evalplus/evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

63
Established
7 parameterlab/MASEval

Multi-Agent LLM Evaluation

59
Established
8 dustalov/evalica

Evalica, your favourite evaluation toolkit

58
Established
9 mohsenhariri/scorio

Statistical evaluation, comparison, and ranking of Large Language Models

54
Established
10 DebarghaG/proofofthought

Proof of thought : LLM-based reasoning using Z3 theorem proving with...

54
Established
11 aiverify-foundation/moonshot

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

51
Established
12 sciknoworg/YESciEval

YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering...

48
Emerging
13 zli12321/qa_metrics

An easy python package to run quick basic QA evaluations. This package...

48
Emerging
14 IAAR-Shanghai/xFinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for...

46
Emerging
15 fiddler-labs/fiddler-auditor

Fiddler Auditor is a tool to evaluate language models.

44
Emerging
16 evo-eval/evoeval

EvoEval: Evolving Coding Benchmarks via LLM

43
Emerging
17 huggingface/evaluation-guidebook

Sharing both practical insights and theoretical knowledge about LLM...

42
Emerging
18 InternScience/SciEvalKit

A unified evaluation toolkit and leaderboard for rigorously assessing the...

42
Emerging
19 lean-dojo/ReProver

Retrieval-Augmented Theorem Provers for Lean

42
Emerging
20 kieranklaassen/leva

LLM Evaluation Framework for Rails apps to be used with production data.

41
Emerging
21 mlchrzan/pairadigm

Concept-Guided Chain-of-Thought (CGCoT) pairwise annotation tool for...

40
Emerging
22 SeekingDream/Static-to-Dynamic-LLMEval

The official GitHub repository of the paper "Recent advances in large...

39
Emerging
23 ShuntaroOkuma/adapt-gauge-core

Measure LLM adaptation efficiency — how fast models learn from few examples

38
Emerging
24 bowen-upenn/PersonaMem

[COLM 2025] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User...

38
Emerging
25 prometheus-eval/prometheus-eval

Evaluate your LLM's response with Prometheus and GPT4 💯

37
Emerging
26 IS2Lab/S-Eval

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large...

37
Emerging
27 ai-twinkle/Eval

Twinkle Eval:高效且準確的 AI 評測工具

37
Emerging
28 alopatenko/LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in...

36
Emerging
29 flexpa/llm-fhir-eval

Benchmarking Large Language Models for FHIR

36
Emerging
30 ai4society/GenAIResultsComparator

A Python library providing evaluation metrics to compare generated texts...

36
Emerging
31 multinear/multinear

Develop reliable AI apps

35
Emerging
32 HiThink-Research/GAGE

General AI evaluation and Gauge Engine. A unified evaluation engine for...

35
Emerging
33 OpenDCAI/One-Eval

Automated system for LLM evaluation via agents.

35
Emerging
34 FastEval/FastEval

Fast & more realistic evaluation of chat language models. Includes leaderboard.

35
Emerging
35 langwatch/langevals

LangEvals aggregates various language model evaluators into a single...

34
Emerging
36 VikhrModels/ru_llm_arena

Modified Arena-Hard-Auto LLM evaluation toolkit with an emphasis on Russian language

34
Emerging
37 namin/llm-verified-with-monte-carlo-tree-search

LLM verified with Monte Carlo Tree Search

34
Emerging
38 root-signals/scorable-sdk

Scorable SDK

33
Emerging
39 IAAR-Shanghai/UHGEval

[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks:...

33
Emerging
40 mims-harvard/Qworld

Qworld: Question-Specific Evaluation Criteria for LLMs

33
Emerging
41 RGGH/evaluate

Evaluate - The Robust LLM Testing Framework 🦀

32
Emerging
42 lmarena/search-arena

⚔️ [ICLR 2026] Official code of "Search Arena: Analyzing Search-Augmented LLMs".

32
Emerging
43 wgryc/phasellm

Large language model evaluation and workflow framework from Phase AI.

32
Emerging
44 superagent-ai/poker-eval

A comprehensive tool for assessing AI Agents performance in simulated poker...

31
Emerging
45 terryyz/ice-score

[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code

31
Emerging
46 pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into...

31
Emerging
47 MLGroupJLU/LLM-eval-survey

The official GitHub page for the survey paper "A Survey on Evaluation of...

30
Emerging
48 franckalbinet/evaluatr

Streamline policy evaluation workflows with AI-driven analysis and...

30
Emerging
49 sileod/llm-theory-of-mind

Testing Theory of Mind (ToM) in language models with epistemic logic

29
Experimental
50 gordicaleksa/serbian-llm-eval

Serbian LLM Eval.

29
Experimental
51 ZeroSumEval/ZeroSumEval

A framework for pitting LLMs against each other in an evolving library of games ⚔

29
Experimental
52 Cohere-Labs/multilingual-llm-evaluation-checklist

mLLM evaluation checklist

28
Experimental
53 CS-EVAL/CS-Eval

CS-Eval is a comprehensive evaluation suite for fundamental cybersecurity...

28
Experimental
54 MisterBrookT/Scorpio

SCORPIO is a system-algorithm co-designed LLM serving engine that...

28
Experimental
55 PeytonCleveland/Darwin

Implementation of prompt evolution based on Evol-Instruct

28
Experimental
56 IAAR-Shanghai/GuessArena

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for...

28
Experimental
57 Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable...

28
Experimental
58 zorse-project/COBOLEval

Evaluate LLM-generated COBOL

27
Experimental
59 Contextualist/lone-arena

Self-hosted LLM chatbot arena, with yourself as the only judge

27
Experimental
60 sinanuozdemir/oreilly-evaluating-llms

Metrics, Benchmarks, and Practical Tools for Assessing Large Language Models

27
Experimental
61 AMDResearch/NPUEval

NPUEval is an LLM evaluation dataset written specifically to target AIE...

26
Experimental
62 GURPREETKAURJETHRA/LLMs-Evaluation

LLMs Evaluation

26
Experimental
63 epam/ai-dial-rag-eval

A python library designed for RAG (Retrieval-Augmented Generation)...

26
Experimental
64 Azure-Samples/llm-eval-grader-samples

Framework for Post-production Evaluation of LLM based ChatBots

26
Experimental
65 mankinds/mankinds-eval

Open-source Python library for evaluating AI systems

25
Experimental
66 mags0ft/hle-eval-ollama

An easy-to-use evaluation tool for running Humanity's Last Exam on (locally)...

25
Experimental
67 claw-eval/claw-eval

Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks...

25
Experimental
68 ElevenLiy/MATEval

MATEval is the first multi-agent framework simulating human collaborative...

25
Experimental
69 mit-ll-ai-technology/llm-sandbox

Large language model evaluation framework for logic and open-ended Q&A with...

25
Experimental
70 GAI-Community/GraphOmni

Enable Comprehensive LLM Evaluation on Graph Reasoning

24
Experimental
71 vienneraphael/layton-eval

layton-eval is an AI eval benchmark for divergent, out-of-the-box and...

24
Experimental
72 allenai/CommonGen-Eval

Evaluating LLMs with CommonGen-Lite

24
Experimental
73 kaistAI/FLASK

[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on...

24
Experimental
74 telekom/llm_evaluation_results

LLM evaluation results

24
Experimental
75 aws-samples/model-as-a-judge-eval

Notebooks for evaluating LLM based applications using the Model (LLM) as a...

24
Experimental
76 Ryota-Kawamura/Evaluating-and-Debugging-Generative-AI

Machine learning and AI projects require managing diverse data sources, vast...

24
Experimental
77 Goodeye-Labs/truesight-docs

Official documentation for Truesight — an AI evaluation platform for scoring...

23
Experimental
78 evalkit/evalkit

The TypeScript LLM Evaluation Library

23
Experimental
79 Aysnc-Labs/llm-eval

A PHP package for evaluating LLM outputs. Test your prompts, validate...

23
Experimental
80 jacobkandel/llm-content-moderation-analysis

Open-Source benchmark tracking LLM censorship and content moderation bias...

23
Experimental
81 prorok9898/ERR-EVAL

🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty...

23
Experimental
82 Humanity-s-Last-Code-Exam/HLCE

(EMNLP 2025 Findings) Source Evaluation scripts for Humanity's Last Code Exam

22
Experimental
83 hitz-zentroa/latxa

Latxa: An Open Language Model and Evaluation Suite for Basque

22
Experimental
84 IngestAI/deepmark

Deepmark AI enables a unique testing environment for language models (LLM)...

22
Experimental
85 McTosh1/modal-llm-evaluator

⚡ Evaluate LLM prompts at scale with fast, parallel execution, real-time...

22
Experimental
86 AntGamerMD21/eval-guide

📊 Explore ML evaluation metrics through interactive notebooks with pre-run...

22
Experimental
87 psandhaas/evaLLM

QA framework for evaluating LLM outputs based on user-defined metrics

22
Experimental
88 hnshah/verdict

LLM eval framework. Compare any model via OpenAI-compatible API.

22
Experimental
89 broomva/nous

Metacognitive evaluation — real-time quality scoring with inline heuristics...

22
Experimental
90 wahhyun/llm-eval

Evaluate large language models with tools for performance and consistency...

22
Experimental
91 Linlichinese/rail-score

🚀 Enable accurate assessment of AI models with the RAIL Score Python SDK,...

22
Experimental
92 brucewlee/nutcracker

Large Model Evaluation Experiments

22
Experimental
93 horde-research/horde-common

Shared scripts for offline Kazakh LLM eval—run inference, auto-score, and...

22
Experimental
94 deshwalmahesh/PHUDGE

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your...

22
Experimental
95 linhaowei1/kumo

☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

22
Experimental
96 franckalbinet/iomeval

Streamline evaluation evidence mapping at scale with LLMs

20
Experimental
97 hparreao/Awesome-AI-Evaluation-Guide

A comprehensive, implementation-focused guide to evaluating Large Language...

20
Experimental
98 vjroy/routeeval

RouteEval: A benchmark for evaluating LLM tool calling in running route...

19
Experimental
99 spenceryonce/LLMeval

Evaluate and compare large language models (LLMs) for chatbot applications,...

19
Experimental
100 lechmazur/sycophancy

LLM benchmark and leaderboard for narrator-bias sycophancy,...

19
Experimental
101 AkhileshMalthi/llm-eval-framework

A production-grade framework for evaluating Large Language Model (LLM)...

19
Experimental
102 AtomEcho/AtomBulb

旨在对当前主流LLM进行一个直观、具体、标准的评测

18
Experimental
103 david-xander/measuring-llm-knowledge

How much does an LLM know about my programming language?

16
Experimental
104 framersai/promptmachine-eval

LLM evaluation framework with ELO ratings, arena battles, and benchmark testing

16
Experimental
105 LeonEricsson/llmjudge

Exploring limitations of LLM-as-a-judge

15
Experimental
106 Vibhanshu-555/Human-Aligned-LLM-Evaluation-Audit

A data-driven audit of AI judge reliability using MT-Bench human...

15
Experimental
107 OleksandrZadvornyi/prompt-engineering

An automated evaluation framework for assessing the credibility of...

15
Experimental
108 BhuvanDontha/YouTube-policy-enforcement-auditor

Independent YouTube evaluation framework for content policy classification....

15
Experimental
109 jaaack-wang/multi-problem-eval-llm

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing...

14
Experimental
110 djador13/moderatefocus

🔍 Analyze community moderation and platform policies with the ModerateFocus...

14
Experimental
111 sanand0/llmmath

How good are LLMs at mental math? An evaluation across 50 models from...

14
Experimental
112 CSLiJT/awesome-lm-evaluation-methodologies

Frontier papers in the evaluation methodologies of language models.

14
Experimental
113 Theepankumargandhi/llm-annotation-quality-pipeline

Production-grade pipeline for validating annotation consistency and...

14
Experimental
114 serhiismetanskyi/llm-output-evaluation-with-deepeval

DeepEval LLM quality evaluation tests with LLM-as-a-judge

14
Experimental
115 MukundaKatta/redpill

The Red Pill Test — Can LLMs recognize the boundaries of their own reality?...

14
Experimental
116 nicolay-r/RuSentRel-Leaderboard

This is an official Leaderboard for the RuSentRel-1.1 dataset originally...

13
Experimental
117 vakyansh/truthfulqa_indic

Truthfulqa_indic, Available in Hindi, Punjabi, Kannada, Tamil and Telugu

12
Experimental
118 giuliano-t/llm-financial-regulatory-auditor

A structured evaluation pipeline for LLM-generated outputs in financial...

12
Experimental
119 crux82/wikigame-llm-eval

Companion repo for CLiC-it 2025 paper on WikiGame. Reproducible pipeline to...

12
Experimental
120 Yifan-Song793/GoodBadGreedy

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore...

12
Experimental
121 dustalov/llmfao

Large Language Model Feedback Analysis and Optimization (LLMFAO)

12
Experimental
122 JinjieNi/MixEval-X

The official github repo for MixEval-X, the first any-to-any, real-world benchmark.

12
Experimental
123 grgong/agent-exam-model-eval

Agent exam built from Posit’s model-eval R LLM benchmark (baseline snapshot...

11
Experimental
124 2pa4ul2/Easygen-v2

Exam Generation With Large Language Model (LLMs)

11
Experimental
125 The-Learning-Algorithm/ai-judge-pipeline

A comprehensive pipeline for generating, analyzing, and evaluating models...

11
Experimental
126 DavidShableski/llm-evaluation-framework

A production-grade platform to evaluate and compare the performance of Large...

11
Experimental
127 arjunpatel7/alakazam-vgc

An LLM powered speed check assistant for Pokemon VGC Players

11
Experimental
128 user1342/conjecture

Evaluating the likelihood of data points in a LLM's training set

11
Experimental
129 krisstallenberg/evaluating-annotations

This repository holds code to annotate textual data using LLMs, and...

11
Experimental
130 SouravD-Me/LLM-Evaluation-Dashboard

A Visual Dashboard for Fundamental Benchmarking of LLMs

10
Experimental
131 prabdeb/agenteval-sample

AgentEval (AutoGen 0.4) Sample Implementation

10
Experimental
132 AYUSH27112021/GENERATIVE-IMAGE-COMPARISION

Different Evaluation Metrics for Image Generation Models

10
Experimental
133 franciellevargas/MFTCXplain

MFTCXplain is the first multilingual benchmark dataset designed to evaluate...

10
Experimental