LLM Comparison Evaluation LLM Tools
Tools for comparing LLM outputs, benchmarking performance across multiple models, and evaluating LLM quality on specific tasks. Does NOT include general LLM evaluation frameworks, prompt engineering resources, or single-model testing tools.
There are 96 llm comparison evaluation tools tracked. 1 score above 70 (verified tier). The highest-rated is open-compass/opencompass at 76/100 with 6,752 stars. 1 of the top 10 are actively maintained.
Get all 96 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-comparison-evaluation&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models... |
|
Verified |
| 2 |
IBM/unitxt
🦄 Unitxt is a Python library for enterprise-grade evaluation of AI... |
|
Established |
| 3 |
lean-dojo/LeanDojo
Tool for data extraction and interacting with Lean programmatically. |
|
Established |
| 4 |
GoodStartLabs/AI_Diplomacy
Frontier Models playing the board game Diplomacy. |
|
Emerging |
| 5 |
salesforce/CodeT5
Home of CodeT5: Open Code LLMs for Code Understanding and Generation |
|
Emerging |
| 6 |
MigoXLab/LMeterX
A general-purpose API load testing platform that supports LLM services and... |
|
Emerging |
| 7 |
namin/dafny-sketcher
piggybacking on the Dafny language implementation to explore interactive... |
|
Emerging |
| 8 |
google/litmus
Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI... |
|
Emerging |
| 9 |
v7labs/benchllm
Continuous Integration for LLM powered applications |
|
Emerging |
| 10 |
NatLabRockies/COMPASS
INFRA-COMPASS is a tool that leverages Large Language Models (LLMs) to... |
|
Emerging |
| 11 |
JonathanChavezTamales/llm-leaderboard
A comprehensive set of LLM benchmark scores and provider prices.... |
|
Emerging |
| 12 |
599yongyang/DatasetLoom
一个面向多模态大模型训练的智能数据集构建与评估平台 |
|
Emerging |
| 13 |
rpjayaraman/RTL2UVM
Automated UVM testbench generator from Verilog RTL with optional LLM... |
|
Emerging |
| 14 |
NikolasEnt/ollama-webui-intel
Ollama with intel (i)GPU acceleration in docker and benchmark |
|
Emerging |
| 15 |
Vvkmnn/awesome-ai-eval
☑️ A curated list of tools, methods & platforms for evaluating AI... |
|
Emerging |
| 16 |
lean-dojo/LeanDojoWebsite
Code for LeanDojo's website |
|
Emerging |
| 17 |
artas728/spelltest
AI-to-AI Testing | Simulation framework for LLM-based applications |
|
Emerging |
| 18 |
NOVADEDOG/energy-leaderboard-runner
Open-source energy benchmark for local LLMs. Measures Wh and CO2 using real... |
|
Emerging |
| 19 |
LudwigStumpp/llm-leaderboard
A joint community effort to create one central leaderboard for LLMs. |
|
Emerging |
| 20 |
vertbera/beyond-the-mirror
Field research exposing how LLM safeguards collapse under polite, persistent... |
|
Emerging |
| 21 |
Supahands/llm-comparison-backend
This is an opensource project allowing you to compare two LLM's head to head... |
|
Emerging |
| 22 |
sealambda/unit-text
Unit tests for plain text - LLM as a copy editor |
|
Emerging |
| 23 |
flashclub/ModelJudge
这是一个基于 Next.js 构建的多语言 AI 模型评估平台,支持多模型对比和实时流式响应。A multilingual AI model... |
|
Emerging |
| 24 |
empirical-run/empirical
Test and evaluate LLMs and model configurations, across all the scenarios... |
|
Emerging |
| 25 |
nexmoe/lm-speed
Help developers optimize AI application performance through comprehensive... |
|
Emerging |
| 26 |
dmeldrum6/LLM-Diff-Tool
Application for comparing responses from different Large Language Models... |
|
Experimental |
| 27 |
jordicor/GranSabio_LLM
Multi-Layer AI Quality Assurance for Content Generation. Multiple LLMs... |
|
Experimental |
| 28 |
LAVA-LAB/COOL-MC
The interface between probabilistic model checking and data-driven policy learning. |
|
Experimental |
| 29 |
jpreagan/llmnop
A tool for measuring LLM performance metrics. |
|
Experimental |
| 30 |
Skripkon/llm_trainer
🤖 Train and evaluate LLMs with ease and fun 🦾 |
|
Experimental |
| 31 |
yinxulai/ait
批量测试符合 OpenAI 协议和 Anthropic 协议的 AI 模型性能指标。支持... |
|
Experimental |
| 32 |
amirdeljouyi/UTGen
Replication package of the ICSE2025 paper titled "Leveraging Large Language... |
|
Experimental |
| 33 |
geminimir/promptproof-action
Deterministic LLM contract checks for CI. Replays recorded fixtures,... |
|
Experimental |
| 34 |
ccarvalho-eng/aludel
LLM Evaluation Workbench |
|
Experimental |
| 35 |
UBC-MDS/fixml
LLM Tool for effective test evaluation of ML projects with curated... |
|
Experimental |
| 36 |
stashlabs/duelr
Compare LLMs in one click |
|
Experimental |
| 37 |
jonathanmli/Avalon-LLM
This repository contains a LLM benchmark for the social deduction game... |
|
Experimental |
| 38 |
georgeguimaraes/alike
Semantic similarity testing for Elixir. Test LLM outputs, chatbots, and NLP in Elixir |
|
Experimental |
| 39 |
shmercer/pairwiseLLM
R Package: Pairwise Comparison Tools for LLM-Based Writing Evaluation |
|
Experimental |
| 40 |
lmg-anon/rp-test-framework
LLM Roleplay Test Framework |
|
Experimental |
| 41 |
dsdanielpark/open-llm-leaderboard-report
Weekly visualization report of Open LLM model performance based on 4 metrics. |
|
Experimental |
| 42 |
hongping-zh/ecocompute-ai
🔋 RTX 5090 energy benchmark suite for LLMs — real NVML power data, not estimates |
|
Experimental |
| 43 |
albertdobmeyer/cobol-legacy-ledger
Learn COBOL through a live banking system — 18 programs, 6-node settlement... |
|
Experimental |
| 44 |
Supahands/llm-comparison
This is an opensource project allowing you to compare two LLM's head to head... |
|
Experimental |
| 45 |
wafer-ai/chipbenchmark
a platform for monitoring the chip situation |
|
Experimental |
| 46 |
INPVLSA/probefish
A web-based LLM prompt and endpoint testing platform. Organize, version,... |
|
Experimental |
| 47 |
kalilurrahman/QualityEngineeringBookByLLMs
Quality Engineering book authored with LLM assistance — exploring modern QE... |
|
Experimental |
| 48 |
AGBAJEMUH/Awesome-AI-Evaluation-Guide
🤖 Evaluate AI systems effectively with our comprehensive guide to methods,... |
|
Experimental |
| 49 |
ellmos-ai/ellmos-tests
Testing framework for LLM operating systems - B/O/E test methodology |
|
Experimental |
| 50 |
piyushgupta344/llm-test-harness
Deterministic testing framework for LLM-powered apps — record/replay... |
|
Experimental |
| 51 |
kishan5111/perfsmith
Tool to find the cheapest self-hosted serving configuration that meets your SLO. |
|
Experimental |
| 52 |
heyqule/evangelion_magi
evangelion magi decision system that links 3 LLM models. |
|
Experimental |
| 53 |
augustocristian/llm-testing-roadmap-rp
Replication package of the artickle: "A Research Roadmap on the Usage of... |
|
Experimental |
| 54 |
Templum/aoide
A TypeScript testing framework for LLM-powered applications. Write tests... |
|
Experimental |
| 55 |
Yuyz0112/relia
Find the Best LLM for Your Needs through E2E Testing |
|
Experimental |
| 56 |
ArslanKAS/Quality-and-Safety-for-LLM-Applications
Explore new metrics and best practices to monitor your LLM systems and... |
|
Experimental |
| 57 |
josephpaulgiroux/ai_categories
Lets AI Language Models compete in a game of AI Categories (similar to... |
|
Experimental |
| 58 |
adilanwar2399/ESBMC-ibmc
The ESBMC ibmc (Invariant Based Model Checking) Tool. |
|
Experimental |
| 59 |
tianzhaotju/EMD
Replication Package for "Large Language Models for Equivalent Mutant... |
|
Experimental |
| 60 |
LeonYang95/LLM4UT
Evaluation code of ASE24 accepted paper "On the Evaluation of LLM in Unit... |
|
Experimental |
| 61 |
brains-on-code/IterativeRefactoringLLM
Replication package, supplementary materials, and analysis pipeline for our... |
|
Experimental |
| 62 |
ksm26/Automated-Testing-for-LLMOps
Create a continuous integration (CI) workflow for testing LLMs applications... |
|
Experimental |
| 63 |
sanand0/hypoforge
Use LLMs to analyze any dataset, create hypotheses from those, test the... |
|
Experimental |
| 64 |
dessertlab/Human_vs_AI_Code_Quality
This repository allows the replication of our study "Human-Written vs.... |
|
Experimental |
| 65 |
AstraBert/DebateLLM-Championship
5 LLMs, 1vs1 matches to produce the most convincing argumentation in favor... |
|
Experimental |
| 66 |
mich1803/Codenames-LLM
Building an AI team to play Codenames using top Large Language Models... |
|
Experimental |
| 67 |
broskees/llm-compare
LLM benchmark comparison tool |
|
Experimental |
| 68 |
ruankie/langfuse-monitoring-eval
Monitoring and evaluating LLM apps with Langfuse. Presented at PyConZA 2024. |
|
Experimental |
| 69 |
Amir-Mohseni/AI-Response-Evaluation
A comprehensive framework to evaluate the quality of AI-generated responses,... |
|
Experimental |
| 70 |
KooshaPari/kwality
🧠 LLM Validation Platform: Advanced testing frameworks with DeepEval,... |
|
Experimental |
| 71 |
RodillasJavier/debate-fallacy-detector
Logical Fallacy Detection in Presidential Debates using a Random Forest... |
|
Experimental |
| 72 |
ml-energy/leaderboard
How much time and energy do modern generative AI models consume? |
|
Experimental |
| 73 |
rololevy/debate-IA-politica-argentina
A debate between two fine-tuned LLMs |
|
Experimental |
| 74 |
mpuodziukas-labs/cobol-demo
COBOL modernization: LLMs introduce bugs, humans validate. Production-grade... |
|
Experimental |
| 75 |
RedKnight-aj/ai-testing-framework
AI Testing Framework using DeepEval - Quality assurance for LLM applications |
|
Experimental |
| 76 |
agent-sh/perf
Rigorous performance investigation workflow with baselines, profiling, and... |
|
Experimental |
| 77 |
AI4InclusiveDeliberation/inclusive_deliberation_llm
Empowering Inclusive E-Deliberation by Harnessing Collective Wisdom and... |
|
Experimental |
| 78 |
seeshuraj/llm-test-lab
🧪 Evaluate, score, and compare LLM outputs before your users do. Automated... |
|
Experimental |
| 79 |
Maik425/promptdiff
Compare LLM outputs across models. One API call. Supports Claude, GPT, Gemini, Grok. |
|
Experimental |
| 80 |
JosephTLucas/llm_test
A suite of tests to verify bias, safety, trust, and security concerns for LLMs. |
|
Experimental |
| 81 |
athina-ai/athina-sdk
LLM Testing SDK that helps you write and run tests to monitor your LLM app... |
|
Experimental |
| 82 |
aiqualitylab/llm-qa-assistant
Compare and validate QA tasks using 3 local (Ollama) or cloud (Groq API)... |
|
Experimental |
| 83 |
waldekmastykarz/openai-compare
Compare the effectiveness of LLMs using OpenAI-compatible APIs |
|
Experimental |
| 84 |
chiragpadyal/AutoTestGen
Automatic Unit Test Generation Testing Suite using LLM as a Visual Studio... |
|
Experimental |
| 85 |
Strawhat404/wb77i-optimizing-high-throughput-chat-message-aggregation
A sample Dataset for AI training to showcase the LLM Benchmarking of... |
|
Experimental |
| 86 |
danpozmanter/llm-comparative-eval
Compare how llm models stack up |
|
Experimental |
| 87 |
giis-uniovi/retorch-llm-rp
Replication package for LLM System testing experimentation |
|
Experimental |
| 88 |
ceccon-t/LicLacMoe
Play tic-tac-toe against a local LLM model. |
|
Experimental |
| 89 |
SevdanurGENC/LLM-Based-Unit-Test-Generator
Automated unit test generation and evaluation using generative AI (GPT-4) |
|
Experimental |
| 90 |
croko22/opsg-unit-test-generation
OPSG-based test refinement for Java: Stable RL approach to generate... |
|
Experimental |
| 91 |
Trust4AI/MUSE
AI-driven Metamorphic Testing Inputs Generator |
|
Experimental |
| 92 |
colingalbraith/Accoutre
Accoutre aims to equip SLMs with tools and measure the gains - A zero-build... |
|
Experimental |
| 93 |
Jeeban420/python-api-frameworks-benchmark
🚀 Benchmark five Python web frameworks under realistic workloads with Docker... |
|
Experimental |
| 94 |
sohambpatel/TestBedGenerator
Creating the test beds with the help of chatgpt, in house LLM OLLAMA and... |
|
Experimental |
| 95 |
thabit-ai/thabit
Thabit is platform to evaluate prompts on multiple LLMs to determine the... |
|
Experimental |
| 96 |
ash-jyc/db84llm
College policy debate as a verbal reasoning benchmark for LLMs |
|
Experimental |