LLM Evaluation Benchmarking ML Frameworks
Frameworks, platforms, and benchmarks for systematically evaluating and comparing LLM performance across metrics like accuracy, safety, reliability, and cost. Does NOT include general LLM applications, deployment tools, or inference optimization.
There are 66 llm evaluation benchmarking frameworks tracked. 1 score above 70 (verified tier). The highest-rated is Cloud-CV/EvalAI at 75/100 with 2,013 stars and 538 monthly downloads. 1 of the top 10 are actively maintained.
Get all 66 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=llm-evaluation-benchmarking&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Framework | Score | Tier |
|---|---|---|---|
| 1 |
Cloud-CV/EvalAI
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of... |
|
Verified |
| 2 |
fireindark707/Python-Schema-Matching
A python tool using XGboost and sentence-transformers to perform schema... |
|
Established |
| 3 |
graphbookai/graphbook
Visual AI development framework for training and inference of ML models,... |
|
Established |
| 4 |
RAILethicsHub/rail-score
Python SDK |
|
Emerging |
| 5 |
Alir3z4/tb-query
A CLI tool and MCP (Model Context Protocol) server for querying and... |
|
Emerging |
| 6 |
visual-layer/fastdup
fastdup is a powerful, free tool designed to rapidly generate valuable... |
|
Emerging |
| 7 |
josh-ashkinaze/plurals
Plurals: A System for Guiding LLMs Via Simulated Social Ensembles |
|
Emerging |
| 8 |
github/CodeSearchNet
Datasets, tools, and benchmarks for representation learning of code. |
|
Emerging |
| 9 |
tthtlc/awesome-source-analysis
Source code understanding via Machine Learning techniques |
|
Emerging |
| 10 |
greynewell/evaldriven.org
Ship evals before you ship features. |
|
Emerging |
| 11 |
Xenios91/Glyph
An architecture independent binary analysis tool for fingerprinting... |
|
Emerging |
| 12 |
paceval/paceval
paceval is a high-performance mathematical runtime for deterministic AI and... |
|
Emerging |
| 13 |
RoboticsData/score_lerobot_episodes
A lightweight toolkit for quantitatively scoring LeRobot episodes. |
|
Emerging |
| 14 |
emredeveloper/Mem-LLM
Mem-LLM is a Python library for building memory-enabled AI assistants that... |
|
Emerging |
| 15 |
kanchengw/cnllm
统一的中文大模型适配库,将主流中国大模型 API 输出封装为 OpenAI 格式,无缝协作openai、langchain等大多数openai结构适配的python库 |
|
Emerging |
| 16 |
ManasVardhan/bench-my-llm
🏎️ Dead-simple LLM benchmarking CLI - latency, cost, and quality metrics |
|
Emerging |
| 17 |
Striveworks/valor
Valor is a lightweight, numpy-based library designed for fast and seamless... |
|
Emerging |
| 18 |
Fir121/llm-classifier
Structured LLM based classification, clustering and extraction framework... |
|
Emerging |
| 19 |
lpalbou/AbstractLLM
A unified interface for Large Language Models with memory, reasoning, and... |
|
Emerging |
| 20 |
khoj-ai/llm-coup
Let LLMs play coup with each other and see who's the best at deception & strategy |
|
Emerging |
| 21 |
AIT-Protocol/einstein-ait-prod
Supercharge Bittensor Ecosystem with Advanced Mathematical and Logical AI |
|
Experimental |
| 22 |
GustyCube/ERR-EVAL
Benchmark for evaluating AI epistemic reliability - testing how well LLMs... |
|
Experimental |
| 23 |
lof310/arch_eval
arch_eval is a high-level library for efficient architecture evaluation of... |
|
Experimental |
| 24 |
lac-dcc/yali
A framework to analyze a space formed by the combination of program... |
|
Experimental |
| 25 |
ApextheBoss/canary
🐤 Know when your LLM provider silently degrades. Automated quality testing... |
|
Experimental |
| 26 |
ztsalexey/epoch-bench
EPOCH: Evaluating Progress Origins in Causal History — LLM benchmark for... |
|
Experimental |
| 27 |
theMethodolojeeOrg/SkynetBench
A rigorous methodology for detecting authority pressure's effect on AI... |
|
Experimental |
| 28 |
metriccoders/ml-models
This is the Metric Coders Model Hub that contains the fastest growing tiny... |
|
Experimental |
| 29 |
jubaedemon/LBBS-Standard
💰 Establish a standard for LLM billing and benchmarking to enable fair... |
|
Experimental |
| 30 |
gmelli/llm-connectivity
Unified Python interface for multiple Large Language Model providers.... |
|
Experimental |
| 31 |
zenprocess/pawbench
PawBench - 4-dimensional LLM inference benchmark. Multi-turn, multi-agent,... |
|
Experimental |
| 32 |
MukundaKatta/ModelMux
ModelMux — Multi-Model Router. Intelligent multi-model routing and fallback... |
|
Experimental |
| 33 |
MukundaKatta/CacheLLM
Semantic caching for LLM responses — n-gram similarity matching, SQLite... |
|
Experimental |
| 34 |
oolong-tea-2026/arena-ai-leaderboards
📊 Daily auto-updated snapshots of all Arena AI (LMSYS Chatbot Arena)... |
|
Experimental |
| 35 |
adrianlol7/evaldriven.org
Define, measure, and enforce code correctness with Eval-Driven Development,... |
|
Experimental |
| 36 |
alextra-lab/slm_server
Unified LLM server with nginx reverse proxy and intelligent routing based on model ID |
|
Experimental |
| 37 |
Vatshayan/Data-Duplication-Removal-using-Machine-Learning
Final Year Project as Deletion of Duplicated data using Machine learning... |
|
Experimental |
| 38 |
WINSTON672/lin-score
The Lin (𝓛) — a fundamental unit of AI cognitive efficiency. Like miles per... |
|
Experimental |
| 39 |
gmelli/llm-judge
A robust Python library for evaluating content using Large Language Models as judges |
|
Experimental |
| 40 |
khansavaleria/likelihoodlum
Detect if a GitHub repo’s code was likely generated by an LLM using commit... |
|
Experimental |
| 41 |
MukundaKatta/LLMProxy
Unified API proxy for LLM providers — OpenAI, Anthropic with fallback... |
|
Experimental |
| 42 |
wapplewhite4/fastdedup
Fast, memory-efficient dataset deduplication for ML workloads |
|
Experimental |
| 43 |
ppashakhanloo/CodeTrek
A powerful relational representation of source code |
|
Experimental |
| 44 |
wkdhkr/dedupper
import various files, detect duplicates with sqlite, reject image file by... |
|
Experimental |
| 45 |
cafebedouin/uke
A multi-layer verification system for AI-generated analysis that exploits... |
|
Experimental |
| 46 |
cr7yash/EvalForge
LLM evaluation platform with 13+ metrics across accuracy, performance, and... |
|
Experimental |
| 47 |
semantic-parsing/semantic-parsing.github.io
Website for "A Survey of Modeling and Data resources for Semantic Parsing" |
|
Experimental |
| 48 |
MPX0222/BroadLearningSystem-APIs-1.0
Modification for Broad Learning System, including BLS, CNN-BLS, PCA-BLS. Now... |
|
Experimental |
| 49 |
tanvirbhachu/ai-bench
A CLI benchmark runner for testing AI Models quickly. |
|
Experimental |
| 50 |
Fardeen37/Data-Duplication-Remover-ML
A powerful machine learning based tool for detecting, analyzing, and... |
|
Experimental |
| 51 |
yc-w-cn/llm-leaderboard
LLM模型对比排行榜 - 帮助用户快速比较不同大语言模型的性能指标、价格和规格 |
|
Experimental |
| 52 |
VarshVishwakarma/stackbench
STACKBENCH is a multi-agent AI research copilot that evaluates developer... |
|
Experimental |
| 53 |
KazKozDev/murmur
A Mix of Agents Orchestration System for Distributed LLM Processing |
|
Experimental |
| 54 |
abject-milkingmachine273/llm-cost-dashboard
Monitor LLM token costs in real time with a terminal dashboard offering... |
|
Experimental |
| 55 |
madalinioana/intent-qualification
Hybrid company qualification pipeline using LLM intent parsing, vector... |
|
Experimental |
| 56 |
42olver/ai-agent-benchmark-compendium
🛠️ Discover and explore over 50 benchmarks for AI agents across key... |
|
Experimental |
| 57 |
syifatoo2751/CC-RLM
Reduce token use by delivering targeted code context to local LLMs with a... |
|
Experimental |
| 58 |
danghoawe/gg-keeper
🔍 Monitor your Giffgaff SIM card data usage easily with this lightweight... |
|
Experimental |
| 59 |
wheldnz/next-evals-oss
🧩 Evaluate Next.js code quality using popular AI models with ease. Get... |
|
Experimental |
| 60 |
jerarddxb-ops/excuse-evaluation-dataset
Rubric-based evaluation dataset simulating RLHF-style AI annotation,... |
|
Experimental |
| 61 |
pzzkkj324244/Bench2Drive-Leaderboard
🚗 Track and compare performance of all methods tested on Bench2Drive,... |
|
Experimental |
| 62 |
davidset13/intelligence_eval
This will allow any agent to use LLM evaluation benchmarks. Currently, this... |
|
Experimental |
| 63 |
Software-Engineering-Arena/SWE-Model-Arena
Compare tool-calling models pairwise via multi‑round evaluations for SE tasks. |
|
Experimental |
| 64 |
Docktorjjd/llm-evaluation-framework
Automated evaluation and testing framework for LLM applications |
|
Experimental |
| 65 |
TJ-Neary/AI-Eval-Pro
Commercial LLM evaluation service — hardware-aware benchmarking across text... |
|
Experimental |
| 66 |
redoh/llm-code-analyzer
🔬 LLM-based static code analysis engine with semantic understanding |
|
Experimental |