LLM Comparison Evaluation LLM Tools

Tools for comparing LLM outputs, benchmarking performance across multiple models, and evaluating LLM quality on specific tasks. Does NOT include general LLM evaluation frameworks, prompt engineering resources, or single-model testing tools.

There are 96 llm comparison evaluation tools tracked. 1 score above 70 (verified tier). The highest-rated is open-compass/opencompass at 76/100 with 6,752 stars. 1 of the top 10 are actively maintained.

Get all 96 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-comparison-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	open-compass/opencompass OpenCompass is an LLM evaluation platform, supporting a wide range of models...	76	Verified	6,752	Python
2	IBM/unitxt 🦄 Unitxt is a Python library for enterprise-grade evaluation of AI...	62	Established	211	Python
3	lean-dojo/LeanDojo Tool for data extraction and interacting with Lean programmatically.	50	Established	778	Python
4	GoodStartLabs/AI_Diplomacy Frontier Models playing the board game Diplomacy.	49	Emerging	634	Python
5	salesforce/CodeT5 Home of CodeT5: Open Code LLMs for Code Understanding and Generation	49	Emerging	3,098	Python
6	MigoXLab/LMeterX A general-purpose API load testing platform that supports LLM services and...	44	Emerging	179	Python
7	namin/dafny-sketcher piggybacking on the Dafny language implementation to explore interactive...	44	Emerging	16	Dafny
8	google/litmus Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI...	43	Emerging	45	Vue
9	v7labs/benchllm Continuous Integration for LLM powered applications	43	Emerging	254	Python
10	NatLabRockies/COMPASS INFRA-COMPASS is a tool that leverages Large Language Models (LLMs) to...	42	Emerging	15	Python
11	JonathanChavezTamales/llm-leaderboard A comprehensive set of LLM benchmark scores and provider prices....	42	Emerging	362	JavaScript
12	599yongyang/DatasetLoom 一个面向多模态大模型训练的智能数据集构建与评估平台	40	Emerging	270	TypeScript
13	rpjayaraman/RTL2UVM Automated UVM testbench generator from Verilog RTL with optional LLM...	40	Emerging	19	SystemVerilog
14	NikolasEnt/ollama-webui-intel Ollama with intel (i)GPU acceleration in docker and benchmark	38	Emerging	41	Python
15	Vvkmnn/awesome-ai-eval ☑️ A curated list of tools, methods & platforms for evaluating AI...	38	Emerging	66	—
16	lean-dojo/LeanDojoWebsite Code for LeanDojo's website	37	Emerging	8	HTML
17	artas728/spelltest AI-to-AI Testing \| Simulation framework for LLM-based applications	37	Emerging	136	Python
18	NOVADEDOG/energy-leaderboard-runner Open-source energy benchmark for local LLMs. Measures Wh and CO2 using real...	37	Emerging	6	TypeScript
19	LudwigStumpp/llm-leaderboard A joint community effort to create one central leaderboard for LLMs.	36	Emerging	307	Python
20	vertbera/beyond-the-mirror Field research exposing how LLM safeguards collapse under polite, persistent...	36	Emerging	2	Python
21	Supahands/llm-comparison-backend This is an opensource project allowing you to compare two LLM's head to head...	36	Emerging	22	Python
22	sealambda/unit-text Unit tests for plain text - LLM as a copy editor	34	Emerging	41	Python
23	flashclub/ModelJudge 这是一个基于 Next.js 构建的多语言 AI 模型评估平台，支持多模型对比和实时流式响应。A multilingual AI model...	32	Emerging	95	TypeScript
24	empirical-run/empirical Test and evaluate LLMs and model configurations, across all the scenarios...	31	Emerging	167	TypeScript
25	nexmoe/lm-speed Help developers optimize AI application performance through comprehensive...	30	Emerging	79	TypeScript
26	dmeldrum6/LLM-Diff-Tool Application for comparing responses from different Large Language Models...	29	Experimental	6	HTML
27	jordicor/GranSabio_LLM Multi-Layer AI Quality Assurance for Content Generation. Multiple LLMs...	29	Experimental	3	Python
28	LAVA-LAB/COOL-MC The interface between probabilistic model checking and data-driven policy learning.	29	Experimental	16	Python
29	jpreagan/llmnop A tool for measuring LLM performance metrics.	28	Experimental	9	Rust
30	Skripkon/llm_trainer 🤖 Train and evaluate LLMs with ease and fun 🦾	28	Experimental	12	Python
31	yinxulai/ait 批量测试符合 OpenAI 协议和 Anthropic 协议的 AI 模型性能指标。支持...	28	Experimental	50	Go
32	amirdeljouyi/UTGen Replication package of the ICSE2025 paper titled "Leveraging Large Language...	28	Experimental	11	Java
33	geminimir/promptproof-action Deterministic LLM contract checks for CI. Replays recorded fixtures,...	27	Experimental	13	HTML
34	ccarvalho-eng/aludel LLM Evaluation Workbench	27	Experimental	9	JavaScript
35	UBC-MDS/fixml LLM Tool for effective test evaluation of ML projects with curated...	26	Experimental	4	Python
36	stashlabs/duelr Compare LLMs in one click	26	Experimental	39	TypeScript
37	jonathanmli/Avalon-LLM This repository contains a LLM benchmark for the social deduction game...	26	Experimental	141	Python
38	georgeguimaraes/alike Semantic similarity testing for Elixir. Test LLM outputs, chatbots, and NLP in Elixir	26	Experimental	41	Elixir
39	shmercer/pairwiseLLM R Package: Pairwise Comparison Tools for LLM-Based Writing Evaluation	25	Experimental	3	R
40	lmg-anon/rp-test-framework LLM Roleplay Test Framework	25	Experimental	5	Python
41	dsdanielpark/open-llm-leaderboard-report Weekly visualization report of Open LLM model performance based on 4 metrics.	24	Experimental	86	Python
42	hongping-zh/ecocompute-ai 🔋 RTX 5090 energy benchmark suite for LLMs — real NVML power data, not estimates	24	Experimental	2	JavaScript
43	albertdobmeyer/cobol-legacy-ledger Learn COBOL through a live banking system — 18 programs, 6-node settlement...	24	Experimental	2	COBOL
44	Supahands/llm-comparison This is an opensource project allowing you to compare two LLM's head to head...	23	Experimental	25	TypeScript
45	wafer-ai/chipbenchmark a platform for monitoring the chip situation	23	Experimental	15	Shell
46	INPVLSA/probefish A web-based LLM prompt and endpoint testing platform. Organize, version,...	23	Experimental	6	TypeScript
47	kalilurrahman/QualityEngineeringBookByLLMs Quality Engineering book authored with LLM assistance — exploring modern QE...	23	Experimental	1	—
48	AGBAJEMUH/Awesome-AI-Evaluation-Guide 🤖 Evaluate AI systems effectively with our comprehensive guide to methods,...	22	Experimental	3	—
49	ellmos-ai/ellmos-tests Testing framework for LLM operating systems - B/O/E test methodology	22	Experimental	—	Python
50	piyushgupta344/llm-test-harness Deterministic testing framework for LLM-powered apps — record/replay...	22	Experimental	—	TypeScript
51	kishan5111/perfsmith Tool to find the cheapest self-hosted serving configuration that meets your SLO.	22	Experimental	—	Python
52	heyqule/evangelion_magi evangelion magi decision system that links 3 LLM models.	22	Experimental	—	JavaScript
53	augustocristian/llm-testing-roadmap-rp Replication package of the artickle: "A Research Roadmap on the Usage of...	22	Experimental	—	JavaScript
54	Templum/aoide A TypeScript testing framework for LLM-powered applications. Write tests...	22	Experimental	—	TypeScript
55	Yuyz0112/relia Find the Best LLM for Your Needs through E2E Testing	22	Experimental	83	TypeScript
56	ArslanKAS/Quality-and-Safety-for-LLM-Applications Explore new metrics and best practices to monitor your LLM systems and...	21	Experimental	5	Jupyter Notebook
57	josephpaulgiroux/ai_categories Lets AI Language Models compete in a game of AI Categories (similar to...	21	Experimental	2	Python
58	adilanwar2399/ESBMC-ibmc The ESBMC ibmc (Invariant Based Model Checking) Tool.	20	Experimental	6	C
59	tianzhaotju/EMD Replication Package for "Large Language Models for Equivalent Mutant...	20	Experimental	9	Python
60	LeonYang95/LLM4UT Evaluation code of ASE24 accepted paper "On the Evaluation of LLM in Unit...	20	Experimental	13	HTML
61	brains-on-code/IterativeRefactoringLLM Replication package, supplementary materials, and analysis pipeline for our...	19	Experimental	—	Java
62	ksm26/Automated-Testing-for-LLMOps Create a continuous integration (CI) workflow for testing LLMs applications...	19	Experimental	6	Jupyter Notebook
63	sanand0/hypoforge Use LLMs to analyze any dataset, create hypotheses from those, test the...	18	Experimental	—	JavaScript
64	dessertlab/Human_vs_AI_Code_Quality This repository allows the replication of our study "Human-Written vs....	17	Experimental	2	Python
65	AstraBert/DebateLLM-Championship 5 LLMs, 1vs1 matches to produce the most convincing argumentation in favor...	17	Experimental	4	Jupyter Notebook
66	mich1803/Codenames-LLM Building an AI team to play Codenames using top Large Language Models...	16	Experimental	2	Jupyter Notebook
67	broskees/llm-compare LLM benchmark comparison tool	16	Experimental	2	HTML
68	ruankie/langfuse-monitoring-eval Monitoring and evaluating LLM apps with Langfuse. Presented at PyConZA 2024.	16	Experimental	4	HTML
69	Amir-Mohseni/AI-Response-Evaluation A comprehensive framework to evaluate the quality of AI-generated responses,...	16	Experimental	3	Jupyter Notebook
70	KooshaPari/kwality 🧠 LLM Validation Platform: Advanced testing frameworks with DeepEval,...	15	Experimental	1	Makefile
71	RodillasJavier/debate-fallacy-detector Logical Fallacy Detection in Presidential Debates using a Random Forest...	15	Experimental	—	Jupyter Notebook
72	ml-energy/leaderboard How much time and energy do modern generative AI models consume?	15	Experimental	5	TypeScript
73	rololevy/debate-IA-politica-argentina A debate between two fine-tuned LLMs	14	Experimental	—	Jupyter Notebook
74	mpuodziukas-labs/cobol-demo COBOL modernization: LLMs introduce bugs, humans validate. Production-grade...	14	Experimental	—	Python
75	RedKnight-aj/ai-testing-framework AI Testing Framework using DeepEval - Quality assurance for LLM applications	14	Experimental	—	Python
76	agent-sh/perf Rigorous performance investigation workflow with baselines, profiling, and...	14	Experimental	—	JavaScript
77	AI4InclusiveDeliberation/inclusive_deliberation_llm Empowering Inclusive E-Deliberation by Harnessing Collective Wisdom and...	14	Experimental	—	Jupyter Notebook
78	seeshuraj/llm-test-lab 🧪 Evaluate, score, and compare LLM outputs before your users do. Automated...	14	Experimental	—	TypeScript
79	Maik425/promptdiff Compare LLM outputs across models. One API call. Supports Claude, GPT, Gemini, Grok.	14	Experimental	—	TypeScript
80	JosephTLucas/llm_test A suite of tests to verify bias, safety, trust, and security concerns for LLMs.	13	Experimental	7	Python
81	athina-ai/athina-sdk LLM Testing SDK that helps you write and run tests to monitor your LLM app...	13	Experimental	132	Python
82	aiqualitylab/llm-qa-assistant Compare and validate QA tasks using 3 local (Ollama) or cloud (Groq API)...	12	Experimental	1	JavaScript
83	waldekmastykarz/openai-compare Compare the effectiveness of LLMs using OpenAI-compatible APIs	12	Experimental	1	Jupyter Notebook
84	chiragpadyal/AutoTestGen Automatic Unit Test Generation Testing Suite using LLM as a Visual Studio...	12	Experimental	3	TypeScript
85	Strawhat404/wb77i-optimizing-high-throughput-chat-message-aggregation A sample Dataset for AI training to showcase the LLM Benchmarking of...	11	Experimental	—	Go
86	danpozmanter/llm-comparative-eval Compare how llm models stack up	11	Experimental	—	Rust
87	giis-uniovi/retorch-llm-rp Replication package for LLM System testing experimentation	11	Experimental	—	Java
88	ceccon-t/LicLacMoe Play tic-tac-toe against a local LLM model.	11	Experimental	—	Java
89	SevdanurGENC/LLM-Based-Unit-Test-Generator Automated unit test generation and evaluation using generative AI (GPT-4)	11	Experimental	—	Jupyter Notebook
90	croko22/opsg-unit-test-generation OPSG-based test refinement for Java: Stable RL approach to generate...	11	Experimental	—	Jupyter Notebook
91	Trust4AI/MUSE AI-driven Metamorphic Testing Inputs Generator	11	Experimental	—	TypeScript
92	colingalbraith/Accoutre Accoutre aims to equip SLMs with tools and measure the gains - A zero-build...	11	Experimental	—	JavaScript
93	Jeeban420/python-api-frameworks-benchmark 🚀 Benchmark five Python web frameworks under realistic workloads with Docker...	11	Experimental	—	—
94	sohambpatel/TestBedGenerator Creating the test beds with the help of chatgpt, in house LLM OLLAMA and...	11	Experimental	—	Java
95	thabit-ai/thabit Thabit is platform to evaluate prompts on multiple LLMs to determine the...	10	Experimental	1	HTML
96	ash-jyc/db84llm College policy debate as a verbal reasoning benchmark for LLMs	10	Experimental	1	Jupyter Notebook

Comparisons in this category

opencompass and COMPASS (76 vs 42) LeanDojo and LeanDojoWebsite (50 vs 37)