LLM Evaluation Benchmarking ML Frameworks

Frameworks, platforms, and benchmarks for systematically evaluating and comparing LLM performance across metrics like accuracy, safety, reliability, and cost. Does NOT include general LLM applications, deployment tools, or inference optimization.

There are 66 llm evaluation benchmarking frameworks tracked. 1 score above 70 (verified tier). The highest-rated is Cloud-CV/EvalAI at 75/100 with 2,013 stars and 538 monthly downloads. 1 of the top 10 are actively maintained.

Get all 66 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 Cloud-CV/EvalAI

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of...

75
Verified
2 fireindark707/Python-Schema-Matching

A python tool using XGboost and sentence-transformers to perform schema...

61
Established
3 graphbookai/graphbook

Visual AI development framework for training and inference of ML models,...

60
Established
4 RAILethicsHub/rail-score

Python SDK

46
Emerging
5 Alir3z4/tb-query

A CLI tool and MCP (Model Context Protocol) server for querying and...

46
Emerging
6 visual-layer/fastdup

fastdup is a powerful, free tool designed to rapidly generate valuable...

44
Emerging
7 josh-ashkinaze/plurals

Plurals: A System for Guiding LLMs Via Simulated Social Ensembles

43
Emerging
8 github/CodeSearchNet

Datasets, tools, and benchmarks for representation learning of code.

42
Emerging
9 tthtlc/awesome-source-analysis

Source code understanding via Machine Learning techniques

40
Emerging
10 greynewell/evaldriven.org

Ship evals before you ship features.

40
Emerging
11 Xenios91/Glyph

An architecture independent binary analysis tool for fingerprinting...

39
Emerging
12 paceval/paceval

paceval is a high-performance mathematical runtime for deterministic AI and...

39
Emerging
13 RoboticsData/score_lerobot_episodes

A lightweight toolkit for quantitatively scoring LeRobot episodes.

39
Emerging
14 emredeveloper/Mem-LLM

Mem-LLM is a Python library for building memory-enabled AI assistants that...

38
Emerging
15 kanchengw/cnllm

统一的中文大模型适配库,将主流中国大模型 API 输出封装为 OpenAI 格式,无缝协作openai、langchain等大多数openai结构适配的python库

38
Emerging
16 ManasVardhan/bench-my-llm

🏎️ Dead-simple LLM benchmarking CLI - latency, cost, and quality metrics

36
Emerging
17 Striveworks/valor

Valor is a lightweight, numpy-based library designed for fast and seamless...

36
Emerging
18 Fir121/llm-classifier

Structured LLM based classification, clustering and extraction framework...

35
Emerging
19 lpalbou/AbstractLLM

A unified interface for Large Language Models with memory, reasoning, and...

31
Emerging
20 khoj-ai/llm-coup

Let LLMs play coup with each other and see who's the best at deception & strategy

30
Emerging
21 AIT-Protocol/einstein-ait-prod

Supercharge Bittensor Ecosystem with Advanced Mathematical and Logical AI

29
Experimental
22 GustyCube/ERR-EVAL

Benchmark for evaluating AI epistemic reliability - testing how well LLMs...

28
Experimental
23 lof310/arch_eval

arch_eval is a high-level library for efficient architecture evaluation of...

25
Experimental
24 lac-dcc/yali

A framework to analyze a space formed by the combination of program...

24
Experimental
25 ApextheBoss/canary

🐤 Know when your LLM provider silently degrades. Automated quality testing...

23
Experimental
26 ztsalexey/epoch-bench

EPOCH: Evaluating Progress Origins in Causal History — LLM benchmark for...

23
Experimental
27 theMethodolojeeOrg/SkynetBench

A rigorous methodology for detecting authority pressure's effect on AI...

23
Experimental
28 metriccoders/ml-models

This is the Metric Coders Model Hub that contains the fastest growing tiny...

23
Experimental
29 jubaedemon/LBBS-Standard

💰 Establish a standard for LLM billing and benchmarking to enable fair...

22
Experimental
30 gmelli/llm-connectivity

Unified Python interface for multiple Large Language Model providers....

22
Experimental
31 zenprocess/pawbench

PawBench - 4-dimensional LLM inference benchmark. Multi-turn, multi-agent,...

22
Experimental
32 MukundaKatta/ModelMux

ModelMux — Multi-Model Router. Intelligent multi-model routing and fallback...

22
Experimental
33 MukundaKatta/CacheLLM

Semantic caching for LLM responses — n-gram similarity matching, SQLite...

22
Experimental
34 oolong-tea-2026/arena-ai-leaderboards

📊 Daily auto-updated snapshots of all Arena AI (LMSYS Chatbot Arena)...

22
Experimental
35 adrianlol7/evaldriven.org

Define, measure, and enforce code correctness with Eval-Driven Development,...

22
Experimental
36 alextra-lab/slm_server

Unified LLM server with nginx reverse proxy and intelligent routing based on model ID

22
Experimental
37 Vatshayan/Data-Duplication-Removal-using-Machine-Learning

Final Year Project as Deletion of Duplicated data using Machine learning...

22
Experimental
38 WINSTON672/lin-score

The Lin (𝓛) — a fundamental unit of AI cognitive efficiency. Like miles per...

22
Experimental
39 gmelli/llm-judge

A robust Python library for evaluating content using Large Language Models as judges

22
Experimental
40 khansavaleria/likelihoodlum

Detect if a GitHub repo’s code was likely generated by an LLM using commit...

22
Experimental
41 MukundaKatta/LLMProxy

Unified API proxy for LLM providers — OpenAI, Anthropic with fallback...

22
Experimental
42 wapplewhite4/fastdedup

Fast, memory-efficient dataset deduplication for ML workloads

21
Experimental
43 ppashakhanloo/CodeTrek

A powerful relational representation of source code

21
Experimental
44 wkdhkr/dedupper

import various files, detect duplicates with sqlite, reject image file by...

21
Experimental
45 cafebedouin/uke

A multi-layer verification system for AI-generated analysis that exploits...

19
Experimental
46 cr7yash/EvalForge

LLM evaluation platform with 13+ metrics across accuracy, performance, and...

19
Experimental
47 semantic-parsing/semantic-parsing.github.io

Website for "A Survey of Modeling and Data resources for Semantic Parsing"

17
Experimental
48 MPX0222/BroadLearningSystem-APIs-1.0

Modification for Broad Learning System, including BLS, CNN-BLS, PCA-BLS. Now...

17
Experimental
49 tanvirbhachu/ai-bench

A CLI benchmark runner for testing AI Models quickly.

16
Experimental
50 Fardeen37/Data-Duplication-Remover-ML

A powerful machine learning based tool for detecting, analyzing, and...

16
Experimental
51 yc-w-cn/llm-leaderboard

LLM模型对比排行榜 - 帮助用户快速比较不同大语言模型的性能指标、价格和规格

16
Experimental
52 VarshVishwakarma/stackbench

STACKBENCH is a multi-agent AI research copilot that evaluates developer...

15
Experimental
53 KazKozDev/murmur

A Mix of Agents Orchestration System for Distributed LLM Processing

14
Experimental
54 abject-milkingmachine273/llm-cost-dashboard

Monitor LLM token costs in real time with a terminal dashboard offering...

14
Experimental
55 madalinioana/intent-qualification

Hybrid company qualification pipeline using LLM intent parsing, vector...

14
Experimental
56 42olver/ai-agent-benchmark-compendium

🛠️ Discover and explore over 50 benchmarks for AI agents across key...

14
Experimental
57 syifatoo2751/CC-RLM

Reduce token use by delivering targeted code context to local LLMs with a...

14
Experimental
58 danghoawe/gg-keeper

🔍 Monitor your Giffgaff SIM card data usage easily with this lightweight...

14
Experimental
59 wheldnz/next-evals-oss

🧩 Evaluate Next.js code quality using popular AI models with ease. Get...

14
Experimental
60 jerarddxb-ops/excuse-evaluation-dataset

Rubric-based evaluation dataset simulating RLHF-style AI annotation,...

14
Experimental
61 pzzkkj324244/Bench2Drive-Leaderboard

🚗 Track and compare performance of all methods tested on Bench2Drive,...

14
Experimental
62 davidset13/intelligence_eval

This will allow any agent to use LLM evaluation benchmarks. Currently, this...

13
Experimental
63 Software-Engineering-Arena/SWE-Model-Arena

Compare tool-calling models pairwise via multi‑round evaluations for SE tasks.

12
Experimental
64 Docktorjjd/llm-evaluation-framework

Automated evaluation and testing framework for LLM applications

11
Experimental
65 TJ-Neary/AI-Eval-Pro

Commercial LLM evaluation service — hardware-aware benchmarking across text...

11
Experimental
66 redoh/llm-code-analyzer

🔬 LLM-based static code analysis engine with semantic understanding

11
Experimental

Comparisons in this category