LLM Evaluation Benchmarking ML Frameworks

Frameworks, platforms, and benchmarks for systematically evaluating and comparing LLM performance across metrics like accuracy, safety, reliability, and cost. Does NOT include general LLM applications, deployment tools, or inference optimization.

There are 66 llm evaluation benchmarking frameworks tracked. 1 score above 70 (verified tier). The highest-rated is Cloud-CV/EvalAI at 75/100 with 2,013 stars and 538 monthly downloads. 1 of the top 10 are actively maintained.

Get all 66 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Framework	Score	Tier	Stars	Language
1	Cloud-CV/EvalAI :cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of...	75	Verified	2,013	Python
2	fireindark707/Python-Schema-Matching A python tool using XGboost and sentence-transformers to perform schema...	61	Established	40	Python
3	graphbookai/graphbook Visual AI development framework for training and inference of ML models,...	60	Established	46	Python
4	RAILethicsHub/rail-score Python SDK	46	Emerging	2	Python
5	Alir3z4/tb-query A CLI tool and MCP (Model Context Protocol) server for querying and...	46	Emerging	6	Python
6	visual-layer/fastdup fastdup is a powerful, free tool designed to rapidly generate valuable...	44	Emerging	1,834	Python
7	josh-ashkinaze/plurals Plurals: A System for Guiding LLMs Via Simulated Social Ensembles	43	Emerging	32	Python
8	github/CodeSearchNet Datasets, tools, and benchmarks for representation learning of code.	42	Emerging	2,417	Jupyter Notebook
9	tthtlc/awesome-source-analysis Source code understanding via Machine Learning techniques	40	Emerging	138	—
10	greynewell/evaldriven.org Ship evals before you ship features.	40	Emerging	18	Nunjucks
11	Xenios91/Glyph An architecture independent binary analysis tool for fingerprinting...	39	Emerging	15	Python
12	paceval/paceval paceval is a high-performance mathematical runtime for deterministic AI and...	39	Emerging	3	HTML
13	RoboticsData/score_lerobot_episodes A lightweight toolkit for quantitatively scoring LeRobot episodes.	39	Emerging	49	Python
14	emredeveloper/Mem-LLM Mem-LLM is a Python library for building memory-enabled AI assistants that...	38	Emerging	7	Python
15	kanchengw/cnllm 统一的中文大模型适配库，将主流中国大模型 API 输出封装为 OpenAI 格式，无缝协作openai、langchain等大多数openai结构适配的python库	38	Emerging	1	Python
16	ManasVardhan/bench-my-llm 🏎️ Dead-simple LLM benchmarking CLI - latency, cost, and quality metrics	36	Emerging	1	Python
17	Striveworks/valor Valor is a lightweight, numpy-based library designed for fast and seamless...	36	Emerging	40	Python
18	Fir121/llm-classifier Structured LLM based classification, clustering and extraction framework...	35	Emerging	2	Python
19	lpalbou/AbstractLLM A unified interface for Large Language Models with memory, reasoning, and...	31	Emerging	2	Python
20	khoj-ai/llm-coup Let LLMs play coup with each other and see who's the best at deception & strategy	30	Emerging	8	TypeScript
21	AIT-Protocol/einstein-ait-prod Supercharge Bittensor Ecosystem with Advanced Mathematical and Logical AI	29	Experimental	13	Python
22	GustyCube/ERR-EVAL Benchmark for evaluating AI epistemic reliability - testing how well LLMs...	28	Experimental	9	Python
23	lof310/arch_eval arch_eval is a high-level library for efficient architecture evaluation of...	25	Experimental	3	Python
24	lac-dcc/yali A framework to analyze a space formed by the combination of program...	24	Experimental	36	LLVM
25	ApextheBoss/canary 🐤 Know when your LLM provider silently degrades. Automated quality testing...	23	Experimental	1	Python
26	ztsalexey/epoch-bench EPOCH: Evaluating Progress Origins in Causal History — LLM benchmark for...	23	Experimental	1	Python
27	theMethodolojeeOrg/SkynetBench A rigorous methodology for detecting authority pressure's effect on AI...	23	Experimental	1	TypeScript
28	metriccoders/ml-models This is the Metric Coders Model Hub that contains the fastest growing tiny...	23	Experimental	2	—
29	jubaedemon/LBBS-Standard 💰 Establish a standard for LLM billing and benchmarking to enable fair...	22	Experimental	—	—
30	gmelli/llm-connectivity Unified Python interface for multiple Large Language Model providers....	22	Experimental	—	Python
31	zenprocess/pawbench PawBench - 4-dimensional LLM inference benchmark. Multi-turn, multi-agent,...	22	Experimental	—	Python
32	MukundaKatta/ModelMux ModelMux — Multi-Model Router. Intelligent multi-model routing and fallback...	22	Experimental	—	Python
33	MukundaKatta/CacheLLM Semantic caching for LLM responses — n-gram similarity matching, SQLite...	22	Experimental	—	Python
34	oolong-tea-2026/arena-ai-leaderboards 📊 Daily auto-updated snapshots of all Arena AI (LMSYS Chatbot Arena)...	22	Experimental	—	Python
35	adrianlol7/evaldriven.org Define, measure, and enforce code correctness with Eval-Driven Development,...	22	Experimental	—	Nunjucks
36	alextra-lab/slm_server Unified LLM server with nginx reverse proxy and intelligent routing based on model ID	22	Experimental	—	Python
37	Vatshayan/Data-Duplication-Removal-using-Machine-Learning Final Year Project as Deletion of Duplicated data using Machine learning...	22	Experimental	67	Jupyter Notebook
38	WINSTON672/lin-score The Lin (𝓛) — a fundamental unit of AI cognitive efficiency. Like miles per...	22	Experimental	—	TeX
39	gmelli/llm-judge A robust Python library for evaluating content using Large Language Models as judges	22	Experimental	—	Python
40	khansavaleria/likelihoodlum Detect if a GitHub repo’s code was likely generated by an LLM using commit...	22	Experimental	—	Python
41	MukundaKatta/LLMProxy Unified API proxy for LLM providers — OpenAI, Anthropic with fallback...	22	Experimental	—	Python
42	wapplewhite4/fastdedup Fast, memory-efficient dataset deduplication for ML workloads	21	Experimental	2	Rust
43	ppashakhanloo/CodeTrek A powerful relational representation of source code	21	Experimental	33	Python
44	wkdhkr/dedupper import various files, detect duplicates with sqlite, reject image file by...	21	Experimental	8	JavaScript
45	cafebedouin/uke A multi-layer verification system for AI-generated analysis that exploits...	19	Experimental	—	TypeScript
46	cr7yash/EvalForge LLM evaluation platform with 13+ metrics across accuracy, performance, and...	19	Experimental	—	TypeScript
47	semantic-parsing/semantic-parsing.github.io Website for "A Survey of Modeling and Data resources for Semantic Parsing"	17	Experimental	4	CSS
48	MPX0222/BroadLearningSystem-APIs-1.0 Modification for Broad Learning System, including BLS, CNN-BLS, PCA-BLS. Now...	17	Experimental	29	Python
49	tanvirbhachu/ai-bench A CLI benchmark runner for testing AI Models quickly.	16	Experimental	1	TypeScript
50	Fardeen37/Data-Duplication-Remover-ML A powerful machine learning based tool for detecting, analyzing, and...	16	Experimental	1	Jupyter Notebook
51	yc-w-cn/llm-leaderboard LLM模型对比排行榜 - 帮助用户快速比较不同大语言模型的性能指标、价格和规格	16	Experimental	1	TypeScript
52	VarshVishwakarma/stackbench STACKBENCH is a multi-agent AI research copilot that evaluates developer...	15	Experimental	—	Python
53	KazKozDev/murmur A Mix of Agents Orchestration System for Distributed LLM Processing	14	Experimental	4	Python
54	abject-milkingmachine273/llm-cost-dashboard Monitor LLM token costs in real time with a terminal dashboard offering...	14	Experimental	—	—
55	madalinioana/intent-qualification Hybrid company qualification pipeline using LLM intent parsing, vector...	14	Experimental	—	Python
56	42olver/ai-agent-benchmark-compendium 🛠️ Discover and explore over 50 benchmarks for AI agents across key...	14	Experimental	—	—
57	syifatoo2751/CC-RLM Reduce token use by delivering targeted code context to local LLMs with a...	14	Experimental	—	—
58	danghoawe/gg-keeper 🔍 Monitor your Giffgaff SIM card data usage easily with this lightweight...	14	Experimental	—	HTML
59	wheldnz/next-evals-oss 🧩 Evaluate Next.js code quality using popular AI models with ease. Get...	14	Experimental	—	CSS
60	jerarddxb-ops/excuse-evaluation-dataset Rubric-based evaluation dataset simulating RLHF-style AI annotation,...	14	Experimental	—	—
61	pzzkkj324244/Bench2Drive-Leaderboard 🚗 Track and compare performance of all methods tested on Bench2Drive,...	14	Experimental	—	TeX
62	davidset13/intelligence_eval This will allow any agent to use LLM evaluation benchmarks. Currently, this...	13	Experimental	2	Python
63	Software-Engineering-Arena/SWE-Model-Arena Compare tool-calling models pairwise via multi‑round evaluations for SE tasks.	12	Experimental	1	Python
64	Docktorjjd/llm-evaluation-framework Automated evaluation and testing framework for LLM applications	11	Experimental	—	JavaScript
65	TJ-Neary/AI-Eval-Pro Commercial LLM evaluation service — hardware-aware benchmarking across text...	11	Experimental	—	—
66	redoh/llm-code-analyzer 🔬 LLM-based static code analysis engine with semantic understanding	11	Experimental	—	—

Comparisons in this category

EvalAI and evaldriven.org (75 vs 40)