Evaluation Frameworks Metrics LLM Tools
Tools for building, running, and standardizing LLM evaluation systems with multiple metrics, benchmarking pipelines, and automated scoring. Does NOT include domain-specific benchmarks (math, code, reasoning) or safety/robustness-focused evaluations.
There are 133 evaluation frameworks metrics tools tracked. 4 score above 70 (verified tier). The highest-rated is EvolvingLMMs-Lab/lmms-eval at 90/100 with 3,883 stars and 9,061 monthly downloads. 3 of the top 10 are actively maintained.
Get all 133 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=evaluation-frameworks-metrics&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks |
|
Verified |
| 2 |
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs),... |
|
Verified |
| 3 |
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents |
|
Verified |
| 4 |
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀 |
|
Verified |
| 5 |
EuroEval/EuroEval
The robust European language model benchmark. |
|
Established |
| 6 |
evalplus/evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024 |
|
Established |
| 7 |
parameterlab/MASEval
Multi-Agent LLM Evaluation |
|
Established |
| 8 |
dustalov/evalica
Evalica, your favourite evaluation toolkit |
|
Established |
| 9 |
mohsenhariri/scorio
Statistical evaluation, comparison, and ranking of Large Language Models |
|
Established |
| 10 |
DebarghaG/proofofthought
Proof of thought : LLM-based reasoning using Z3 theorem proving with... |
|
Established |
| 11 |
aiverify-foundation/moonshot
Moonshot - A simple and modular tool to evaluate and red-team any LLM application. |
|
Established |
| 12 |
sciknoworg/YESciEval
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering... |
|
Emerging |
| 13 |
zli12321/qa_metrics
An easy python package to run quick basic QA evaluations. This package... |
|
Emerging |
| 14 |
IAAR-Shanghai/xFinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for... |
|
Emerging |
| 15 |
fiddler-labs/fiddler-auditor
Fiddler Auditor is a tool to evaluate language models. |
|
Emerging |
| 16 |
evo-eval/evoeval
EvoEval: Evolving Coding Benchmarks via LLM |
|
Emerging |
| 17 |
huggingface/evaluation-guidebook
Sharing both practical insights and theoretical knowledge about LLM... |
|
Emerging |
| 18 |
InternScience/SciEvalKit
A unified evaluation toolkit and leaderboard for rigorously assessing the... |
|
Emerging |
| 19 |
lean-dojo/ReProver
Retrieval-Augmented Theorem Provers for Lean |
|
Emerging |
| 20 |
kieranklaassen/leva
LLM Evaluation Framework for Rails apps to be used with production data. |
|
Emerging |
| 21 |
mlchrzan/pairadigm
Concept-Guided Chain-of-Thought (CGCoT) pairwise annotation tool for... |
|
Emerging |
| 22 |
SeekingDream/Static-to-Dynamic-LLMEval
The official GitHub repository of the paper "Recent advances in large... |
|
Emerging |
| 23 |
ShuntaroOkuma/adapt-gauge-core
Measure LLM adaptation efficiency — how fast models learn from few examples |
|
Emerging |
| 24 |
bowen-upenn/PersonaMem
[COLM 2025] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User... |
|
Emerging |
| 25 |
prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯 |
|
Emerging |
| 26 |
IS2Lab/S-Eval
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large... |
|
Emerging |
| 27 |
ai-twinkle/Eval
Twinkle Eval:高效且準確的 AI 評測工具 |
|
Emerging |
| 28 |
alopatenko/LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in... |
|
Emerging |
| 29 |
flexpa/llm-fhir-eval
Benchmarking Large Language Models for FHIR |
|
Emerging |
| 30 |
ai4society/GenAIResultsComparator
A Python library providing evaluation metrics to compare generated texts... |
|
Emerging |
| 31 |
multinear/multinear
Develop reliable AI apps |
|
Emerging |
| 32 |
HiThink-Research/GAGE
General AI evaluation and Gauge Engine. A unified evaluation engine for... |
|
Emerging |
| 33 |
OpenDCAI/One-Eval
Automated system for LLM evaluation via agents. |
|
Emerging |
| 34 |
FastEval/FastEval
Fast & more realistic evaluation of chat language models. Includes leaderboard. |
|
Emerging |
| 35 |
langwatch/langevals
LangEvals aggregates various language model evaluators into a single... |
|
Emerging |
| 36 |
VikhrModels/ru_llm_arena
Modified Arena-Hard-Auto LLM evaluation toolkit with an emphasis on Russian language |
|
Emerging |
| 37 |
namin/llm-verified-with-monte-carlo-tree-search
LLM verified with Monte Carlo Tree Search |
|
Emerging |
| 38 |
root-signals/scorable-sdk
Scorable SDK |
|
Emerging |
| 39 |
IAAR-Shanghai/UHGEval
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks:... |
|
Emerging |
| 40 |
mims-harvard/Qworld
Qworld: Question-Specific Evaluation Criteria for LLMs |
|
Emerging |
| 41 |
RGGH/evaluate
Evaluate - The Robust LLM Testing Framework 🦀 |
|
Emerging |
| 42 |
lmarena/search-arena
⚔️ [ICLR 2026] Official code of "Search Arena: Analyzing Search-Augmented LLMs". |
|
Emerging |
| 43 |
wgryc/phasellm
Large language model evaluation and workflow framework from Phase AI. |
|
Emerging |
| 44 |
superagent-ai/poker-eval
A comprehensive tool for assessing AI Agents performance in simulated poker... |
|
Emerging |
| 45 |
terryyz/ice-score
[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code |
|
Emerging |
| 46 |
pyladiesams/eval-llm-based-apps-jan2025
Create an evaluation framework for your LLM based app. Incorporate it into... |
|
Emerging |
| 47 |
MLGroupJLU/LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of... |
|
Emerging |
| 48 |
franckalbinet/evaluatr
Streamline policy evaluation workflows with AI-driven analysis and... |
|
Emerging |
| 49 |
sileod/llm-theory-of-mind
Testing Theory of Mind (ToM) in language models with epistemic logic |
|
Experimental |
| 50 |
gordicaleksa/serbian-llm-eval
Serbian LLM Eval. |
|
Experimental |
| 51 |
ZeroSumEval/ZeroSumEval
A framework for pitting LLMs against each other in an evolving library of games ⚔ |
|
Experimental |
| 52 |
Cohere-Labs/multilingual-llm-evaluation-checklist
mLLM evaluation checklist |
|
Experimental |
| 53 |
CS-EVAL/CS-Eval
CS-Eval is a comprehensive evaluation suite for fundamental cybersecurity... |
|
Experimental |
| 54 |
MisterBrookT/Scorpio
SCORPIO is a system-algorithm co-designed LLM serving engine that... |
|
Experimental |
| 55 |
PeytonCleveland/Darwin
Implementation of prompt evolution based on Evol-Instruct |
|
Experimental |
| 56 |
IAAR-Shanghai/GuessArena
[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for... |
|
Experimental |
| 57 |
Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable... |
|
Experimental |
| 58 |
zorse-project/COBOLEval
Evaluate LLM-generated COBOL |
|
Experimental |
| 59 |
Contextualist/lone-arena
Self-hosted LLM chatbot arena, with yourself as the only judge |
|
Experimental |
| 60 |
sinanuozdemir/oreilly-evaluating-llms
Metrics, Benchmarks, and Practical Tools for Assessing Large Language Models |
|
Experimental |
| 61 |
AMDResearch/NPUEval
NPUEval is an LLM evaluation dataset written specifically to target AIE... |
|
Experimental |
| 62 |
GURPREETKAURJETHRA/LLMs-Evaluation
LLMs Evaluation |
|
Experimental |
| 63 |
epam/ai-dial-rag-eval
A python library designed for RAG (Retrieval-Augmented Generation)... |
|
Experimental |
| 64 |
Azure-Samples/llm-eval-grader-samples
Framework for Post-production Evaluation of LLM based ChatBots |
|
Experimental |
| 65 |
mankinds/mankinds-eval
Open-source Python library for evaluating AI systems |
|
Experimental |
| 66 |
mags0ft/hle-eval-ollama
An easy-to-use evaluation tool for running Humanity's Last Exam on (locally)... |
|
Experimental |
| 67 |
claw-eval/claw-eval
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks... |
|
Experimental |
| 68 |
ElevenLiy/MATEval
MATEval is the first multi-agent framework simulating human collaborative... |
|
Experimental |
| 69 |
mit-ll-ai-technology/llm-sandbox
Large language model evaluation framework for logic and open-ended Q&A with... |
|
Experimental |
| 70 |
GAI-Community/GraphOmni
Enable Comprehensive LLM Evaluation on Graph Reasoning |
|
Experimental |
| 71 |
vienneraphael/layton-eval
layton-eval is an AI eval benchmark for divergent, out-of-the-box and... |
|
Experimental |
| 72 |
allenai/CommonGen-Eval
Evaluating LLMs with CommonGen-Lite |
|
Experimental |
| 73 |
kaistAI/FLASK
[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on... |
|
Experimental |
| 74 |
telekom/llm_evaluation_results
LLM evaluation results |
|
Experimental |
| 75 |
aws-samples/model-as-a-judge-eval
Notebooks for evaluating LLM based applications using the Model (LLM) as a... |
|
Experimental |
| 76 |
Ryota-Kawamura/Evaluating-and-Debugging-Generative-AI
Machine learning and AI projects require managing diverse data sources, vast... |
|
Experimental |
| 77 |
Goodeye-Labs/truesight-docs
Official documentation for Truesight — an AI evaluation platform for scoring... |
|
Experimental |
| 78 |
evalkit/evalkit
The TypeScript LLM Evaluation Library |
|
Experimental |
| 79 |
Aysnc-Labs/llm-eval
A PHP package for evaluating LLM outputs. Test your prompts, validate... |
|
Experimental |
| 80 |
jacobkandel/llm-content-moderation-analysis
Open-Source benchmark tracking LLM censorship and content moderation bias... |
|
Experimental |
| 81 |
prorok9898/ERR-EVAL
🔍 Evaluate AI models' ability to detect ambiguity and manage uncertainty... |
|
Experimental |
| 82 |
Humanity-s-Last-Code-Exam/HLCE
(EMNLP 2025 Findings) Source Evaluation scripts for Humanity's Last Code Exam |
|
Experimental |
| 83 |
hitz-zentroa/latxa
Latxa: An Open Language Model and Evaluation Suite for Basque |
|
Experimental |
| 84 |
IngestAI/deepmark
Deepmark AI enables a unique testing environment for language models (LLM)... |
|
Experimental |
| 85 |
McTosh1/modal-llm-evaluator
⚡ Evaluate LLM prompts at scale with fast, parallel execution, real-time... |
|
Experimental |
| 86 |
AntGamerMD21/eval-guide
📊 Explore ML evaluation metrics through interactive notebooks with pre-run... |
|
Experimental |
| 87 |
psandhaas/evaLLM
QA framework for evaluating LLM outputs based on user-defined metrics |
|
Experimental |
| 88 |
hnshah/verdict
LLM eval framework. Compare any model via OpenAI-compatible API. |
|
Experimental |
| 89 |
broomva/nous
Metacognitive evaluation — real-time quality scoring with inline heuristics... |
|
Experimental |
| 90 |
wahhyun/llm-eval
Evaluate large language models with tools for performance and consistency... |
|
Experimental |
| 91 |
Linlichinese/rail-score
🚀 Enable accurate assessment of AI models with the RAIL Score Python SDK,... |
|
Experimental |
| 92 |
brucewlee/nutcracker
Large Model Evaluation Experiments |
|
Experimental |
| 93 |
horde-research/horde-common
Shared scripts for offline Kazakh LLM eval—run inference, auto-score, and... |
|
Experimental |
| 94 |
deshwalmahesh/PHUDGE
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your... |
|
Experimental |
| 95 |
linhaowei1/kumo
☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models |
|
Experimental |
| 96 |
franckalbinet/iomeval
Streamline evaluation evidence mapping at scale with LLMs |
|
Experimental |
| 97 |
hparreao/Awesome-AI-Evaluation-Guide
A comprehensive, implementation-focused guide to evaluating Large Language... |
|
Experimental |
| 98 |
vjroy/routeeval
RouteEval: A benchmark for evaluating LLM tool calling in running route... |
|
Experimental |
| 99 |
spenceryonce/LLMeval
Evaluate and compare large language models (LLMs) for chatbot applications,... |
|
Experimental |
| 100 |
lechmazur/sycophancy
LLM benchmark and leaderboard for narrator-bias sycophancy,... |
|
Experimental |
| 101 |
AkhileshMalthi/llm-eval-framework
A production-grade framework for evaluating Large Language Model (LLM)... |
|
Experimental |
| 102 |
AtomEcho/AtomBulb
旨在对当前主流LLM进行一个直观、具体、标准的评测 |
|
Experimental |
| 103 |
david-xander/measuring-llm-knowledge
How much does an LLM know about my programming language? |
|
Experimental |
| 104 |
framersai/promptmachine-eval
LLM evaluation framework with ELO ratings, arena battles, and benchmark testing |
|
Experimental |
| 105 |
LeonEricsson/llmjudge
Exploring limitations of LLM-as-a-judge |
|
Experimental |
| 106 |
Vibhanshu-555/Human-Aligned-LLM-Evaluation-Audit
A data-driven audit of AI judge reliability using MT-Bench human... |
|
Experimental |
| 107 |
OleksandrZadvornyi/prompt-engineering
An automated evaluation framework for assessing the credibility of... |
|
Experimental |
| 108 |
BhuvanDontha/YouTube-policy-enforcement-auditor
Independent YouTube evaluation framework for content policy classification.... |
|
Experimental |
| 109 |
jaaack-wang/multi-problem-eval-llm
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing... |
|
Experimental |
| 110 |
djador13/moderatefocus
🔍 Analyze community moderation and platform policies with the ModerateFocus... |
|
Experimental |
| 111 |
sanand0/llmmath
How good are LLMs at mental math? An evaluation across 50 models from... |
|
Experimental |
| 112 |
CSLiJT/awesome-lm-evaluation-methodologies
Frontier papers in the evaluation methodologies of language models. |
|
Experimental |
| 113 |
Theepankumargandhi/llm-annotation-quality-pipeline
Production-grade pipeline for validating annotation consistency and... |
|
Experimental |
| 114 |
serhiismetanskyi/llm-output-evaluation-with-deepeval
DeepEval LLM quality evaluation tests with LLM-as-a-judge |
|
Experimental |
| 115 |
MukundaKatta/redpill
The Red Pill Test — Can LLMs recognize the boundaries of their own reality?... |
|
Experimental |
| 116 |
nicolay-r/RuSentRel-Leaderboard
This is an official Leaderboard for the RuSentRel-1.1 dataset originally... |
|
Experimental |
| 117 |
vakyansh/truthfulqa_indic
Truthfulqa_indic, Available in Hindi, Punjabi, Kannada, Tamil and Telugu |
|
Experimental |
| 118 |
giuliano-t/llm-financial-regulatory-auditor
A structured evaluation pipeline for LLM-generated outputs in financial... |
|
Experimental |
| 119 |
crux82/wikigame-llm-eval
Companion repo for CLiC-it 2025 paper on WikiGame. Reproducible pipeline to... |
|
Experimental |
| 120 |
Yifan-Song793/GoodBadGreedy
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore... |
|
Experimental |
| 121 |
dustalov/llmfao
Large Language Model Feedback Analysis and Optimization (LLMFAO) |
|
Experimental |
| 122 |
JinjieNi/MixEval-X
The official github repo for MixEval-X, the first any-to-any, real-world benchmark. |
|
Experimental |
| 123 |
grgong/agent-exam-model-eval
Agent exam built from Posit’s model-eval R LLM benchmark (baseline snapshot... |
|
Experimental |
| 124 |
2pa4ul2/Easygen-v2
Exam Generation With Large Language Model (LLMs) |
|
Experimental |
| 125 |
The-Learning-Algorithm/ai-judge-pipeline
A comprehensive pipeline for generating, analyzing, and evaluating models... |
|
Experimental |
| 126 |
DavidShableski/llm-evaluation-framework
A production-grade platform to evaluate and compare the performance of Large... |
|
Experimental |
| 127 |
arjunpatel7/alakazam-vgc
An LLM powered speed check assistant for Pokemon VGC Players |
|
Experimental |
| 128 |
user1342/conjecture
Evaluating the likelihood of data points in a LLM's training set |
|
Experimental |
| 129 |
krisstallenberg/evaluating-annotations
This repository holds code to annotate textual data using LLMs, and... |
|
Experimental |
| 130 |
SouravD-Me/LLM-Evaluation-Dashboard
A Visual Dashboard for Fundamental Benchmarking of LLMs |
|
Experimental |
| 131 |
prabdeb/agenteval-sample
AgentEval (AutoGen 0.4) Sample Implementation |
|
Experimental |
| 132 |
AYUSH27112021/GENERATIVE-IMAGE-COMPARISION
Different Evaluation Metrics for Image Generation Models |
|
Experimental |
| 133 |
franciellevargas/MFTCXplain
MFTCXplain is the first multilingual benchmark dataset designed to evaluate... |
|
Experimental |