LLM Evaluation Frameworks Prompt Engineering Tools

Systematic benchmarking and testing suites for evaluating LLM prompt strategies, output quality, consistency, and factuality across multiple models and tasks. Does NOT include prompt optimization tools, hallucination-reduction techniques alone, or general LLM deployment platforms.

There are 101 llm evaluation frameworks tools tracked. 1 score above 70 (verified tier). The highest-rated is microsoft/promptbench at 70/100 with 2,785 stars and 288 monthly downloads.

Get all 101 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=prompt-engineering&subcategory=llm-evaluation-frameworks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 microsoft/promptbench

A unified evaluation framework for large language models

70
Verified
2 uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve...

55
Established
3 microsoftarchive/promptbench

A unified evaluation framework for large language models

45
Emerging
4 gabe-mousa/Apolien

AI Safety Evaluation Library

45
Emerging
5 levitation-opensource/Manipulative-Expression-Recognition

MER is a software that identifies and highlights manipulative communication...

38
Emerging
6 PromptMixerDev/prompt-mixer-app-ce

A desktop application for comparing outputs from different Large Language...

34
Emerging
7 GSA/FedRAMP-OllaLab-Lean

The OllaLab-Lean project is designed to help both novice and experienced...

34
Emerging
8 babelcloud/LLM-RGB

LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios...

34
Emerging
9 ryoungj/ToolEmu

[ICLR'24 Spotlight] A language model (LM)-based emulation framework for...

33
Emerging
10 kiyoshisasano/llm-failure-atlas

A graph-based failure modeling and deterministic detection system for LLM...

32
Emerging
11 ozturkoktay/insurance-llm-framework

An interactive framework for experimenting with and evaluating open-source...

30
Emerging
12 syamsasi99/prompt-evaluator

prompt-evaluator is an open-source toolkit for evaluating, testing, and...

30
Emerging
13 fau-masters-collected-works-cgarbin/llm-comparison-tool

A tool to compare multiple large language models (LLMs) side by side

26
Experimental
14 realadeel/llm-test-bench

Compare LLM providers (OpenAI, Claude, Gemini) for vision tasks - benchmark...

26
Experimental
15 mary-lev/llm-ocr

LLM-powered OCR evaluation and correction package that supports multiple...

26
Experimental
16 pablo-chacon/Spoon-Bending

Educational analysis of LLM alignment, safety behavior, and...

25
Experimental
17 sidoody/heart-context-pack

Compiling the HEART Score into a structured, model-facing policy artifact...

23
Experimental
18 joshualamerton/Modelbench

Concept: benchmarking harness for prompts, models, and agent strategies

23
Experimental
19 SyntagmaNull/judgment-hygiene-stack

Tri-skill framework for structure routing, evidence discipline, and judgment...

23
Experimental
20 jameswniu/self-hosted-llm-evals-lab

Evaluation framework for self-hosted LLMs. Systematic prompt ablation...

23
Experimental
21 GnomeMan4201/drift-artifact

Stylometric drift experiment — documents that demonstrate iterative...

23
Experimental
22 lpr021/redteam-ai-benchmark

🧪 Evaluate uncensored LLMs for offensive security with targeted questions...

23
Experimental
23 reiidoda/OpenRe

Open-source AI agent evaluation workbench for benchmarking, tracing,...

22
Experimental
24 aaddii09/llm-eval-harness

🔍 Run efficient evaluations for prompt and LLM regression testing with this...

22
Experimental
25 AspenXDev/job-evaluation-engine

Modular prompt-engineered system for deterministic job evaluation with...

22
Experimental
26 MarcKarbowiak/ai-evaluation-harness

Production-minded evaluation harness for LLM features with structured...

22
Experimental
27 kogunlowo123/ai-evaluation-prompts

Prompt evaluation framework with accuracy, coherence, safety rubrics, and...

22
Experimental
28 kanupriya-GuptaM/llm-agreement-bias-benchmark

Benchmark framework for detecting agreement bias and answer instability in...

22
Experimental
29 paradite/eval-data

Prompts and evaluation data for LLMs on real world coding and writing tasks

22
Experimental
30 EviAmarates/fresta-edge

Domain evaluation lens generator built on the Fresta Lens Framework

22
Experimental
31 adityaarunsinghal/LLM-As-A-Judge-Prompt-Improver

Scientific framework for iterative LLM prompt improvement using...

22
Experimental
32 mohosy/OpenEvals

Open-source eval studio for prompt comparisons, regression tracking, and...

22
Experimental
33 MVidicek/evalkit

Test your prompts like you test your code. Regression testing for LLM applications.

22
Experimental
34 Amir-ElBelawy/llm-failure-mode-taxonomy

A practitioner's taxonomy of recurring failure patterns in large language...

22
Experimental
35 chirindaopensource/auditable_AI_agent_loop_for_empirical_economics

End-to-End Python implementation of Shin (2026)'s evaluator-locked agentic...

22
Experimental
36 deadbits/trs

🔭 Threat report analysis via LLM and Vector DB

22
Experimental
37 hsieh89t-cloud/legal-agent-reliability-benchmark

Reliability and hallucination mitigation research for tool-augmented legal...

22
Experimental
38 hideyuki001/unified-cognitive-os-v1.8

Judgment decomposition architecture for translation QA, ASR review, AI...

22
Experimental
39 kustonaut/llm-eval-kit

Quality scoring, eval suites, and regression detection for LLM outputs.

22
Experimental
40 kepiCHelaSHen/context-hacking

Turn LLM priors into scientific rigor. Zero-drift multi-agent framework for...

22
Experimental
41 IgnazioDS/evalops-workbench

A local-first evaluation harness for prompts, tools, and agents with...

22
Experimental
42 Chunduri-Aditya/Model-Behavior-Lab

Local Ollama-based LLM evaluation platform that benchmarks reasoning,...

20
Experimental
43 petersimmons1972/brutal-evaluation

AI skill for brutally honest project feedback. Based on Dylan Davis's BRUTAL...

20
Experimental
44 maxpetrusenko/llm-eval-notes

Public LLM evaluation artifacts: hallucination, brittleness, structured...

20
Experimental
45 Ravevx/LLM-Spatial-Reasoning-Evaluation-2D-Physics-Puzzle

A benchmark environment for evaluating large language models’ spatial...

20
Experimental
46 tpertner/squeeze

Squeeze your model with pressure prompts to see if its behavior leaks.

19
Experimental
47 michaelflppv/prompt-llm-benchmark

Prompt LLM Bench is a platform that discovers compatible Hugging Face models...

19
Experimental
48 hirbis/prompt-governance

Replication package for "Prompt Governance in Financial AI" (Girolli, 2026)....

19
Experimental
49 gwasiakshay/llm-eval-benchmark

LLM evaluation & benchmarking framework using LLM-as-a-judge scoring,...

19
Experimental
50 vivek8849/llm-trust-evaluator

A production-ready framework for evaluating LLM reliability using semantic...

19
Experimental
51 aleremfer/prompt-eval-cases

Prompt comparison and evaluation across multiple LLMs (EN/ES)

19
Experimental
52 aikenkyu001/semantic_roundtrip_benchmark_2

This repository contains the primary contributions of our research paper, "A...

19
Experimental
53 firechair/AI-Engineering-Critique

🚀 An interactive platform for LLM Preference Learning and Comparative...

19
Experimental
54 Philipnil06/ai-output-quality-lab

A structured experiment framework for prompt variation, evaluation, and...

19
Experimental
55 LeNguyenAnhKhoa/Hallucination-Detection

Hallucination Detection using LLM's API

18
Experimental
56 thuanystuart/DD3412-chain-of-verification-reproduction

Re-implementation of the paper "Chain-of-Verification Reduces Hallucination...

18
Experimental
57 r4u-dev/open-r4u

Optimize AI & Maximize ROI of your LLM tasks. Evaluates current state and...

18
Experimental
58 GTMVP/modal-llm-evaluator

Run 1,000 LLM evaluations in 10 minutes. Test prompts across Claude, GPT-4,...

16
Experimental
59 vihanga/prompt-sandbox

Testing framework for LLM prompts. Started as a weekend project after...

15
Experimental
60 aikenkyu001/benchmarking_llm_against_prompt_formats

Official experimental environment for 'Benchmarking LLM Sensitivity to...

15
Experimental
61 moses-shenassa/llm-prompt-framework-and-eval-suite

Prompt engineering framework + evaluation harness for LLM workflows...

15
Experimental
62 flamehaven01/CRoM-EfficientLLM

A Python toolkit to optimize LLM context by intelligently selecting,...

15
Experimental
63 antzedek/dar-quickfix

Runtime patch that kills LLM loops, drift & hallucinations in real-time –...

15
Experimental
64 lkilefner/llm-quality-evaluation-examples

K–12 LLM evaluation examples using teacher-centered ground truths, rubrics,...

15
Experimental
65 Codegrammer999/prompt-bench

This is a benchmark suite comparing zero-shot, few-shot, Chain-of-Thought,...

15
Experimental
66 FlosMume/LLM-Safety-Labs-Starter

Foundation for building safer generative-AI systems — includes example...

15
Experimental
67 rahul-sg/HondaResearchLabs_DSC180A-Eval-Systems-Of-NextGen-LLMs

Domain-aware LLM summary evaluation and iterative refinement pipeline with...

15
Experimental
68 ktjkc/reflextrust

🧠 LLMs don’t just process text — they read the room. Meaning emerges through...

14
Experimental
69 sportixIndia/LBOS-LCAS-LP-Contradiction-tracker

🔍 Track contradictions in AI and human content with LBOS-LCAS, enhancing...

14
Experimental
70 antsuebae/TFG-LLM-RE

TFG: Evaluación comparativa de LLMs locales vs. cloud en Ingeniería de...

14
Experimental
71 bensonbabu93/llm-prompt-evaluation-framework

A prompt experimentation tool that benchmarks LLM responses across multiple...

14
Experimental
72 YifanHe0126/medical-mllm-evaluation

Evaluation and model selection workflow for open-source multimodal LLMs in...

14
Experimental
73 AW-VB/llm-mcq-benchmark

Benchmarking open-weight LLMs on multiple-choice QA with prompt comparison,...

14
Experimental
74 rechriti/llm-risk-analysis

LLM-based risk analysis system using prompt engineering and evaluation (NDA-safe)

14
Experimental
75 rahulthadhani/llm-benchmark

A benchmark suite that tests how zero-shot, few-shot, chain-of-thought, and...

14
Experimental
76 illogical/LMEval

Web application for systematic prompt engineering and model evaluation

14
Experimental
77 jharter-stack/prompt-evals

prompt-evals — Prompt testing, comparisons, refinements, and failure cases

14
Experimental
78 gamzeakkurt/Prompt-Evaluation-in-AWS-Bedrock

Prompt evaluation framework using AWS Bedrock to assess LLM outputs with...

14
Experimental
79 wzy6642/I3C-Select

Official implementation for "Instructing Large Language Models to Identify...

13
Experimental
80 ghazal001/LLM-C-Grading-Agent

Ongoing LLM-based grading agent for automated evaluation of C++ programming...

13
Experimental
81 Ziechoes/reasoning-invariance-benchmark

Experiments testing whether LLM reasoning trajectories remain invariant when...

12
Experimental
82 useentropy/llmkit

LLM Kit - Python Large Language Model Kit for generating data of your choice

12
Experimental
83 BOSSMAN-dev89/LBOS-LCAS-LP-Contradiction-tracker

A tool for auditing bias through large language models

12
Experimental
84 rlin25/FrizzlesRubric

A modular system for automated, multi-metric AI prompt evaluation—featuring...

12
Experimental
85 chirindaopensource/llm_faithfulness_hallucination_misalignment_detection

End-to-End Python implementation of Semantic Divergence Metrics (SDM) for...

12
Experimental
86 yuchenzhu-research/iclr2026-cao-prompt-drift-lab

A reproducible evaluation framework for studying how small prompt variations...

11
Experimental
87 sergeyklay/factly

CLI tool to evaluate LLM factuality on MMLU benchmark.

11
Experimental
88 noah-art3mis/crucible

Develop better LLM apps by testing different models and prompts in bulk.

11
Experimental
89 GoodCODER280722/llm-output-validator

Rule-based AI output validation CLI tool (mock mode) with structured JSON reporting.

11
Experimental
90 jadhav045/DeepStack-AILM-Assignment

A strict, provider-agnostic User Input Validator powered exclusively by LLMs...

11
Experimental
91 SiemonCha/ECM3401-LLM-Essay-Scoring

Measuring semantic robustness in LLM-based CEFR essay scoring through...

11
Experimental
92 mtchynkstff/llm-ed-eval

A reproducible evaluation framework analyzing how prompt strategies affect...

11
Experimental
93 1rajatk/content-judgment-calibrator

A judgment calibration framework for auditing content clarity, credibility,...

11
Experimental
94 Laksh-star/ai-fluency-gym

Educational AI fluency self-assessment inspired by the 4D framework, with...

11
Experimental
95 KSVQ/openrouter-harness

Lightweight OpenRouter evaluation harness with web UI, batch runs, and a...

11
Experimental
96 eugeniusms/TextualVerifier

LLM-Based Textual Verifier using Chain-of-Thought, Variant Generation, and...

11
Experimental
97 TheSkyBiz/llm-persona-drift-evaluation

945-generation adversarial evaluation of 3 open LLMs across 3 personas and...

11
Experimental
98 motasemwed/llm-judge

LLM-as-a-Judge system for rubric-based, explainable evaluation of large...

11
Experimental
99 YaswanthGhanta/llm-logical-integrity-benchmark

Adversarial testing of LLMs on constraint satisfaction deadlocks

11
Experimental
100 OptionalSoftware/concurrent

The Multi-LLM Benchmarking Tool

10
Experimental
101 ghazaleh-mahmoodi/Prompting_LLMs_AS_Explainable_Metrics

Eval4NLP Shared Task on Prompting Large Language Models as Explainable Metrics

10
Experimental