LLM Comparison Evaluation LLM Tools

Tools for comparing LLM outputs, benchmarking performance across multiple models, and evaluating LLM quality on specific tasks. Does NOT include general LLM evaluation frameworks, prompt engineering resources, or single-model testing tools.

There are 96 llm comparison evaluation tools tracked. 1 score above 70 (verified tier). The highest-rated is open-compass/opencompass at 76/100 with 6,752 stars. 1 of the top 10 are actively maintained.

Get all 96 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-comparison-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models...

76
Verified
2 IBM/unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI...

62
Established
3 lean-dojo/LeanDojo

Tool for data extraction and interacting with Lean programmatically.

50
Established
4 GoodStartLabs/AI_Diplomacy

Frontier Models playing the board game Diplomacy.

49
Emerging
5 salesforce/CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

49
Emerging
6 MigoXLab/LMeterX

A general-purpose API load testing platform that supports LLM services and...

44
Emerging
7 namin/dafny-sketcher

piggybacking on the Dafny language implementation to explore interactive...

44
Emerging
8 google/litmus

Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI...

43
Emerging
9 v7labs/benchllm

Continuous Integration for LLM powered applications

43
Emerging
10 NatLabRockies/COMPASS

INFRA-COMPASS is a tool that leverages Large Language Models (LLMs) to...

42
Emerging
11 JonathanChavezTamales/llm-leaderboard

A comprehensive set of LLM benchmark scores and provider prices....

42
Emerging
12 599yongyang/DatasetLoom

一个面向多模态大模型训练的智能数据集构建与评估平台

40
Emerging
13 rpjayaraman/RTL2UVM

Automated UVM testbench generator from Verilog RTL with optional LLM...

40
Emerging
14 NikolasEnt/ollama-webui-intel

Ollama with intel (i)GPU acceleration in docker and benchmark

38
Emerging
15 Vvkmnn/awesome-ai-eval

☑️ A curated list of tools, methods & platforms for evaluating AI...

38
Emerging
16 lean-dojo/LeanDojoWebsite

Code for LeanDojo's website

37
Emerging
17 artas728/spelltest

AI-to-AI Testing | Simulation framework for LLM-based applications

37
Emerging
18 NOVADEDOG/energy-leaderboard-runner

Open-source energy benchmark for local LLMs. Measures Wh and CO2 using real...

37
Emerging
19 LudwigStumpp/llm-leaderboard

A joint community effort to create one central leaderboard for LLMs.

36
Emerging
20 vertbera/beyond-the-mirror

Field research exposing how LLM safeguards collapse under polite, persistent...

36
Emerging
21 Supahands/llm-comparison-backend

This is an opensource project allowing you to compare two LLM's head to head...

36
Emerging
22 sealambda/unit-text

Unit tests for plain text - LLM as a copy editor

34
Emerging
23 flashclub/ModelJudge

这是一个基于 Next.js 构建的多语言 AI 模型评估平台,支持多模型对比和实时流式响应。A multilingual AI model...

32
Emerging
24 empirical-run/empirical

Test and evaluate LLMs and model configurations, across all the scenarios...

31
Emerging
25 nexmoe/lm-speed

Help developers optimize AI application performance through comprehensive...

30
Emerging
26 dmeldrum6/LLM-Diff-Tool

Application for comparing responses from different Large Language Models...

29
Experimental
27 jordicor/GranSabio_LLM

Multi-Layer AI Quality Assurance for Content Generation. Multiple LLMs...

29
Experimental
28 LAVA-LAB/COOL-MC

The interface between probabilistic model checking and data-driven policy learning.

29
Experimental
29 jpreagan/llmnop

A tool for measuring LLM performance metrics.

28
Experimental
30 Skripkon/llm_trainer

🤖 Train and evaluate LLMs with ease and fun 🦾

28
Experimental
31 yinxulai/ait

批量测试符合 OpenAI 协议和 Anthropic 协议的 AI 模型性能指标。支持...

28
Experimental
32 amirdeljouyi/UTGen

Replication package of the ICSE2025 paper titled "Leveraging Large Language...

28
Experimental
33 geminimir/promptproof-action

Deterministic LLM contract checks for CI. Replays recorded fixtures,...

27
Experimental
34 ccarvalho-eng/aludel

LLM Evaluation Workbench

27
Experimental
35 UBC-MDS/fixml

LLM Tool for effective test evaluation of ML projects with curated...

26
Experimental
36 stashlabs/duelr

Compare LLMs in one click

26
Experimental
37 jonathanmli/Avalon-LLM

This repository contains a LLM benchmark for the social deduction game...

26
Experimental
38 georgeguimaraes/alike

Semantic similarity testing for Elixir. Test LLM outputs, chatbots, and NLP in Elixir

26
Experimental
39 shmercer/pairwiseLLM

R Package: Pairwise Comparison Tools for LLM-Based Writing Evaluation

25
Experimental
40 lmg-anon/rp-test-framework

LLM Roleplay Test Framework

25
Experimental
41 dsdanielpark/open-llm-leaderboard-report

Weekly visualization report of Open LLM model performance based on 4 metrics.

24
Experimental
42 hongping-zh/ecocompute-ai

🔋 RTX 5090 energy benchmark suite for LLMs — real NVML power data, not estimates

24
Experimental
43 albertdobmeyer/cobol-legacy-ledger

Learn COBOL through a live banking system — 18 programs, 6-node settlement...

24
Experimental
44 Supahands/llm-comparison

This is an opensource project allowing you to compare two LLM's head to head...

23
Experimental
45 wafer-ai/chipbenchmark

a platform for monitoring the chip situation

23
Experimental
46 INPVLSA/probefish

A web-based LLM prompt and endpoint testing platform. Organize, version,...

23
Experimental
47 kalilurrahman/QualityEngineeringBookByLLMs

Quality Engineering book authored with LLM assistance — exploring modern QE...

23
Experimental
48 AGBAJEMUH/Awesome-AI-Evaluation-Guide

🤖 Evaluate AI systems effectively with our comprehensive guide to methods,...

22
Experimental
49 ellmos-ai/ellmos-tests

Testing framework for LLM operating systems - B/O/E test methodology

22
Experimental
50 piyushgupta344/llm-test-harness

Deterministic testing framework for LLM-powered apps — record/replay...

22
Experimental
51 kishan5111/perfsmith

Tool to find the cheapest self-hosted serving configuration that meets your SLO.

22
Experimental
52 heyqule/evangelion_magi

evangelion magi decision system that links 3 LLM models.

22
Experimental
53 augustocristian/llm-testing-roadmap-rp

Replication package of the artickle: "A Research Roadmap on the Usage of...

22
Experimental
54 Templum/aoide

A TypeScript testing framework for LLM-powered applications. Write tests...

22
Experimental
55 Yuyz0112/relia

Find the Best LLM for Your Needs through E2E Testing

22
Experimental
56 ArslanKAS/Quality-and-Safety-for-LLM-Applications

Explore new metrics and best practices to monitor your LLM systems and...

21
Experimental
57 josephpaulgiroux/ai_categories

Lets AI Language Models compete in a game of AI Categories (similar to...

21
Experimental
58 adilanwar2399/ESBMC-ibmc

The ESBMC ibmc (Invariant Based Model Checking) Tool.

20
Experimental
59 tianzhaotju/EMD

Replication Package for "Large Language Models for Equivalent Mutant...

20
Experimental
60 LeonYang95/LLM4UT

Evaluation code of ASE24 accepted paper "On the Evaluation of LLM in Unit...

20
Experimental
61 brains-on-code/IterativeRefactoringLLM

Replication package, supplementary materials, and analysis pipeline for our...

19
Experimental
62 ksm26/Automated-Testing-for-LLMOps

Create a continuous integration (CI) workflow for testing LLMs applications...

19
Experimental
63 sanand0/hypoforge

Use LLMs to analyze any dataset, create hypotheses from those, test the...

18
Experimental
64 dessertlab/Human_vs_AI_Code_Quality

This repository allows the replication of our study "Human-Written vs....

17
Experimental
65 AstraBert/DebateLLM-Championship

5 LLMs, 1vs1 matches to produce the most convincing argumentation in favor...

17
Experimental
66 mich1803/Codenames-LLM

Building an AI team to play Codenames using top Large Language Models...

16
Experimental
67 broskees/llm-compare

LLM benchmark comparison tool

16
Experimental
68 ruankie/langfuse-monitoring-eval

Monitoring and evaluating LLM apps with Langfuse. Presented at PyConZA 2024.

16
Experimental
69 Amir-Mohseni/AI-Response-Evaluation

A comprehensive framework to evaluate the quality of AI-generated responses,...

16
Experimental
70 KooshaPari/kwality

🧠 LLM Validation Platform: Advanced testing frameworks with DeepEval,...

15
Experimental
71 RodillasJavier/debate-fallacy-detector

Logical Fallacy Detection in Presidential Debates using a Random Forest...

15
Experimental
72 ml-energy/leaderboard

How much time and energy do modern generative AI models consume?

15
Experimental
73 rololevy/debate-IA-politica-argentina

A debate between two fine-tuned LLMs

14
Experimental
74 mpuodziukas-labs/cobol-demo

COBOL modernization: LLMs introduce bugs, humans validate. Production-grade...

14
Experimental
75 RedKnight-aj/ai-testing-framework

AI Testing Framework using DeepEval - Quality assurance for LLM applications

14
Experimental
76 agent-sh/perf

Rigorous performance investigation workflow with baselines, profiling, and...

14
Experimental
77 AI4InclusiveDeliberation/inclusive_deliberation_llm

Empowering Inclusive E-Deliberation by Harnessing Collective Wisdom and...

14
Experimental
78 seeshuraj/llm-test-lab

🧪 Evaluate, score, and compare LLM outputs before your users do. Automated...

14
Experimental
79 Maik425/promptdiff

Compare LLM outputs across models. One API call. Supports Claude, GPT, Gemini, Grok.

14
Experimental
80 JosephTLucas/llm_test

A suite of tests to verify bias, safety, trust, and security concerns for LLMs.

13
Experimental
81 athina-ai/athina-sdk

LLM Testing SDK that helps you write and run tests to monitor your LLM app...

13
Experimental
82 aiqualitylab/llm-qa-assistant

Compare and validate QA tasks using 3 local (Ollama) or cloud (Groq API)...

12
Experimental
83 waldekmastykarz/openai-compare

Compare the effectiveness of LLMs using OpenAI-compatible APIs

12
Experimental
84 chiragpadyal/AutoTestGen

Automatic Unit Test Generation Testing Suite using LLM as a Visual Studio...

12
Experimental
85 Strawhat404/wb77i-optimizing-high-throughput-chat-message-aggregation

A sample Dataset for AI training to showcase the LLM Benchmarking of...

11
Experimental
86 danpozmanter/llm-comparative-eval

Compare how llm models stack up

11
Experimental
87 giis-uniovi/retorch-llm-rp

Replication package for LLM System testing experimentation

11
Experimental
88 ceccon-t/LicLacMoe

Play tic-tac-toe against a local LLM model.

11
Experimental
89 SevdanurGENC/LLM-Based-Unit-Test-Generator

Automated unit test generation and evaluation using generative AI (GPT-4)

11
Experimental
90 croko22/opsg-unit-test-generation

OPSG-based test refinement for Java: Stable RL approach to generate...

11
Experimental
91 Trust4AI/MUSE

AI-driven Metamorphic Testing Inputs Generator

11
Experimental
92 colingalbraith/Accoutre

Accoutre aims to equip SLMs with tools and measure the gains - A zero-build...

11
Experimental
93 Jeeban420/python-api-frameworks-benchmark

🚀 Benchmark five Python web frameworks under realistic workloads with Docker...

11
Experimental
94 sohambpatel/TestBedGenerator

Creating the test beds with the help of chatgpt, in house LLM OLLAMA and...

11
Experimental
95 thabit-ai/thabit

Thabit is platform to evaluate prompts on multiple LLMs to determine the...

10
Experimental
96 ash-jyc/db84llm

College policy debate as a verbal reasoning benchmark for LLMs

10
Experimental