LLM Evaluation Frameworks Prompt Engineering Tools
Systematic benchmarking and testing suites for evaluating LLM prompt strategies, output quality, consistency, and factuality across multiple models and tasks. Does NOT include prompt optimization tools, hallucination-reduction techniques alone, or general LLM deployment platforms.
There are 101 llm evaluation frameworks tools tracked. 1 score above 70 (verified tier). The highest-rated is microsoft/promptbench at 70/100 with 2,785 stars and 288 monthly downloads.
Get all 101 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=prompt-engineering&subcategory=llm-evaluation-frameworks&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
microsoft/promptbench
A unified evaluation framework for large language models |
|
Verified |
| 2 |
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve... |
|
Established |
| 3 |
microsoftarchive/promptbench
A unified evaluation framework for large language models |
|
Emerging |
| 4 |
gabe-mousa/Apolien
AI Safety Evaluation Library |
|
Emerging |
| 5 |
levitation-opensource/Manipulative-Expression-Recognition
MER is a software that identifies and highlights manipulative communication... |
|
Emerging |
| 6 |
PromptMixerDev/prompt-mixer-app-ce
A desktop application for comparing outputs from different Large Language... |
|
Emerging |
| 7 |
GSA/FedRAMP-OllaLab-Lean
The OllaLab-Lean project is designed to help both novice and experienced... |
|
Emerging |
| 8 |
babelcloud/LLM-RGB
LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios... |
|
Emerging |
| 9 |
ryoungj/ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for... |
|
Emerging |
| 10 |
kiyoshisasano/llm-failure-atlas
A graph-based failure modeling and deterministic detection system for LLM... |
|
Emerging |
| 11 |
ozturkoktay/insurance-llm-framework
An interactive framework for experimenting with and evaluating open-source... |
|
Emerging |
| 12 |
syamsasi99/prompt-evaluator
prompt-evaluator is an open-source toolkit for evaluating, testing, and... |
|
Emerging |
| 13 |
fau-masters-collected-works-cgarbin/llm-comparison-tool
A tool to compare multiple large language models (LLMs) side by side |
|
Experimental |
| 14 |
realadeel/llm-test-bench
Compare LLM providers (OpenAI, Claude, Gemini) for vision tasks - benchmark... |
|
Experimental |
| 15 |
mary-lev/llm-ocr
LLM-powered OCR evaluation and correction package that supports multiple... |
|
Experimental |
| 16 |
pablo-chacon/Spoon-Bending
Educational analysis of LLM alignment, safety behavior, and... |
|
Experimental |
| 17 |
sidoody/heart-context-pack
Compiling the HEART Score into a structured, model-facing policy artifact... |
|
Experimental |
| 18 |
joshualamerton/Modelbench
Concept: benchmarking harness for prompts, models, and agent strategies |
|
Experimental |
| 19 |
SyntagmaNull/judgment-hygiene-stack
Tri-skill framework for structure routing, evidence discipline, and judgment... |
|
Experimental |
| 20 |
jameswniu/self-hosted-llm-evals-lab
Evaluation framework for self-hosted LLMs. Systematic prompt ablation... |
|
Experimental |
| 21 |
GnomeMan4201/drift-artifact
Stylometric drift experiment — documents that demonstrate iterative... |
|
Experimental |
| 22 |
lpr021/redteam-ai-benchmark
🧪 Evaluate uncensored LLMs for offensive security with targeted questions... |
|
Experimental |
| 23 |
reiidoda/OpenRe
Open-source AI agent evaluation workbench for benchmarking, tracing,... |
|
Experimental |
| 24 |
aaddii09/llm-eval-harness
🔍 Run efficient evaluations for prompt and LLM regression testing with this... |
|
Experimental |
| 25 |
AspenXDev/job-evaluation-engine
Modular prompt-engineered system for deterministic job evaluation with... |
|
Experimental |
| 26 |
MarcKarbowiak/ai-evaluation-harness
Production-minded evaluation harness for LLM features with structured... |
|
Experimental |
| 27 |
kogunlowo123/ai-evaluation-prompts
Prompt evaluation framework with accuracy, coherence, safety rubrics, and... |
|
Experimental |
| 28 |
kanupriya-GuptaM/llm-agreement-bias-benchmark
Benchmark framework for detecting agreement bias and answer instability in... |
|
Experimental |
| 29 |
paradite/eval-data
Prompts and evaluation data for LLMs on real world coding and writing tasks |
|
Experimental |
| 30 |
EviAmarates/fresta-edge
Domain evaluation lens generator built on the Fresta Lens Framework |
|
Experimental |
| 31 |
adityaarunsinghal/LLM-As-A-Judge-Prompt-Improver
Scientific framework for iterative LLM prompt improvement using... |
|
Experimental |
| 32 |
mohosy/OpenEvals
Open-source eval studio for prompt comparisons, regression tracking, and... |
|
Experimental |
| 33 |
MVidicek/evalkit
Test your prompts like you test your code. Regression testing for LLM applications. |
|
Experimental |
| 34 |
Amir-ElBelawy/llm-failure-mode-taxonomy
A practitioner's taxonomy of recurring failure patterns in large language... |
|
Experimental |
| 35 |
chirindaopensource/auditable_AI_agent_loop_for_empirical_economics
End-to-End Python implementation of Shin (2026)'s evaluator-locked agentic... |
|
Experimental |
| 36 |
deadbits/trs
🔭 Threat report analysis via LLM and Vector DB |
|
Experimental |
| 37 |
hsieh89t-cloud/legal-agent-reliability-benchmark
Reliability and hallucination mitigation research for tool-augmented legal... |
|
Experimental |
| 38 |
hideyuki001/unified-cognitive-os-v1.8
Judgment decomposition architecture for translation QA, ASR review, AI... |
|
Experimental |
| 39 |
kustonaut/llm-eval-kit
Quality scoring, eval suites, and regression detection for LLM outputs. |
|
Experimental |
| 40 |
kepiCHelaSHen/context-hacking
Turn LLM priors into scientific rigor. Zero-drift multi-agent framework for... |
|
Experimental |
| 41 |
IgnazioDS/evalops-workbench
A local-first evaluation harness for prompts, tools, and agents with... |
|
Experimental |
| 42 |
Chunduri-Aditya/Model-Behavior-Lab
Local Ollama-based LLM evaluation platform that benchmarks reasoning,... |
|
Experimental |
| 43 |
petersimmons1972/brutal-evaluation
AI skill for brutally honest project feedback. Based on Dylan Davis's BRUTAL... |
|
Experimental |
| 44 |
maxpetrusenko/llm-eval-notes
Public LLM evaluation artifacts: hallucination, brittleness, structured... |
|
Experimental |
| 45 |
Ravevx/LLM-Spatial-Reasoning-Evaluation-2D-Physics-Puzzle
A benchmark environment for evaluating large language models’ spatial... |
|
Experimental |
| 46 |
tpertner/squeeze
Squeeze your model with pressure prompts to see if its behavior leaks. |
|
Experimental |
| 47 |
michaelflppv/prompt-llm-benchmark
Prompt LLM Bench is a platform that discovers compatible Hugging Face models... |
|
Experimental |
| 48 |
hirbis/prompt-governance
Replication package for "Prompt Governance in Financial AI" (Girolli, 2026).... |
|
Experimental |
| 49 |
gwasiakshay/llm-eval-benchmark
LLM evaluation & benchmarking framework using LLM-as-a-judge scoring,... |
|
Experimental |
| 50 |
vivek8849/llm-trust-evaluator
A production-ready framework for evaluating LLM reliability using semantic... |
|
Experimental |
| 51 |
aleremfer/prompt-eval-cases
Prompt comparison and evaluation across multiple LLMs (EN/ES) |
|
Experimental |
| 52 |
aikenkyu001/semantic_roundtrip_benchmark_2
This repository contains the primary contributions of our research paper, "A... |
|
Experimental |
| 53 |
firechair/AI-Engineering-Critique
🚀 An interactive platform for LLM Preference Learning and Comparative... |
|
Experimental |
| 54 |
Philipnil06/ai-output-quality-lab
A structured experiment framework for prompt variation, evaluation, and... |
|
Experimental |
| 55 |
LeNguyenAnhKhoa/Hallucination-Detection
Hallucination Detection using LLM's API |
|
Experimental |
| 56 |
thuanystuart/DD3412-chain-of-verification-reproduction
Re-implementation of the paper "Chain-of-Verification Reduces Hallucination... |
|
Experimental |
| 57 |
r4u-dev/open-r4u
Optimize AI & Maximize ROI of your LLM tasks. Evaluates current state and... |
|
Experimental |
| 58 |
GTMVP/modal-llm-evaluator
Run 1,000 LLM evaluations in 10 minutes. Test prompts across Claude, GPT-4,... |
|
Experimental |
| 59 |
vihanga/prompt-sandbox
Testing framework for LLM prompts. Started as a weekend project after... |
|
Experimental |
| 60 |
aikenkyu001/benchmarking_llm_against_prompt_formats
Official experimental environment for 'Benchmarking LLM Sensitivity to... |
|
Experimental |
| 61 |
moses-shenassa/llm-prompt-framework-and-eval-suite
Prompt engineering framework + evaluation harness for LLM workflows... |
|
Experimental |
| 62 |
flamehaven01/CRoM-EfficientLLM
A Python toolkit to optimize LLM context by intelligently selecting,... |
|
Experimental |
| 63 |
antzedek/dar-quickfix
Runtime patch that kills LLM loops, drift & hallucinations in real-time –... |
|
Experimental |
| 64 |
lkilefner/llm-quality-evaluation-examples
K–12 LLM evaluation examples using teacher-centered ground truths, rubrics,... |
|
Experimental |
| 65 |
Codegrammer999/prompt-bench
This is a benchmark suite comparing zero-shot, few-shot, Chain-of-Thought,... |
|
Experimental |
| 66 |
FlosMume/LLM-Safety-Labs-Starter
Foundation for building safer generative-AI systems — includes example... |
|
Experimental |
| 67 |
rahul-sg/HondaResearchLabs_DSC180A-Eval-Systems-Of-NextGen-LLMs
Domain-aware LLM summary evaluation and iterative refinement pipeline with... |
|
Experimental |
| 68 |
ktjkc/reflextrust
🧠 LLMs don’t just process text — they read the room. Meaning emerges through... |
|
Experimental |
| 69 |
sportixIndia/LBOS-LCAS-LP-Contradiction-tracker
🔍 Track contradictions in AI and human content with LBOS-LCAS, enhancing... |
|
Experimental |
| 70 |
antsuebae/TFG-LLM-RE
TFG: Evaluación comparativa de LLMs locales vs. cloud en Ingeniería de... |
|
Experimental |
| 71 |
bensonbabu93/llm-prompt-evaluation-framework
A prompt experimentation tool that benchmarks LLM responses across multiple... |
|
Experimental |
| 72 |
YifanHe0126/medical-mllm-evaluation
Evaluation and model selection workflow for open-source multimodal LLMs in... |
|
Experimental |
| 73 |
AW-VB/llm-mcq-benchmark
Benchmarking open-weight LLMs on multiple-choice QA with prompt comparison,... |
|
Experimental |
| 74 |
rechriti/llm-risk-analysis
LLM-based risk analysis system using prompt engineering and evaluation (NDA-safe) |
|
Experimental |
| 75 |
rahulthadhani/llm-benchmark
A benchmark suite that tests how zero-shot, few-shot, chain-of-thought, and... |
|
Experimental |
| 76 |
illogical/LMEval
Web application for systematic prompt engineering and model evaluation |
|
Experimental |
| 77 |
jharter-stack/prompt-evals
prompt-evals — Prompt testing, comparisons, refinements, and failure cases |
|
Experimental |
| 78 |
gamzeakkurt/Prompt-Evaluation-in-AWS-Bedrock
Prompt evaluation framework using AWS Bedrock to assess LLM outputs with... |
|
Experimental |
| 79 |
wzy6642/I3C-Select
Official implementation for "Instructing Large Language Models to Identify... |
|
Experimental |
| 80 |
ghazal001/LLM-C-Grading-Agent
Ongoing LLM-based grading agent for automated evaluation of C++ programming... |
|
Experimental |
| 81 |
Ziechoes/reasoning-invariance-benchmark
Experiments testing whether LLM reasoning trajectories remain invariant when... |
|
Experimental |
| 82 |
useentropy/llmkit
LLM Kit - Python Large Language Model Kit for generating data of your choice |
|
Experimental |
| 83 |
BOSSMAN-dev89/LBOS-LCAS-LP-Contradiction-tracker
A tool for auditing bias through large language models |
|
Experimental |
| 84 |
rlin25/FrizzlesRubric
A modular system for automated, multi-metric AI prompt evaluation—featuring... |
|
Experimental |
| 85 |
chirindaopensource/llm_faithfulness_hallucination_misalignment_detection
End-to-End Python implementation of Semantic Divergence Metrics (SDM) for... |
|
Experimental |
| 86 |
yuchenzhu-research/iclr2026-cao-prompt-drift-lab
A reproducible evaluation framework for studying how small prompt variations... |
|
Experimental |
| 87 |
sergeyklay/factly
CLI tool to evaluate LLM factuality on MMLU benchmark. |
|
Experimental |
| 88 |
noah-art3mis/crucible
Develop better LLM apps by testing different models and prompts in bulk. |
|
Experimental |
| 89 |
GoodCODER280722/llm-output-validator
Rule-based AI output validation CLI tool (mock mode) with structured JSON reporting. |
|
Experimental |
| 90 |
jadhav045/DeepStack-AILM-Assignment
A strict, provider-agnostic User Input Validator powered exclusively by LLMs... |
|
Experimental |
| 91 |
SiemonCha/ECM3401-LLM-Essay-Scoring
Measuring semantic robustness in LLM-based CEFR essay scoring through... |
|
Experimental |
| 92 |
mtchynkstff/llm-ed-eval
A reproducible evaluation framework analyzing how prompt strategies affect... |
|
Experimental |
| 93 |
1rajatk/content-judgment-calibrator
A judgment calibration framework for auditing content clarity, credibility,... |
|
Experimental |
| 94 |
Laksh-star/ai-fluency-gym
Educational AI fluency self-assessment inspired by the 4D framework, with... |
|
Experimental |
| 95 |
KSVQ/openrouter-harness
Lightweight OpenRouter evaluation harness with web UI, batch runs, and a... |
|
Experimental |
| 96 |
eugeniusms/TextualVerifier
LLM-Based Textual Verifier using Chain-of-Thought, Variant Generation, and... |
|
Experimental |
| 97 |
TheSkyBiz/llm-persona-drift-evaluation
945-generation adversarial evaluation of 3 open LLMs across 3 personas and... |
|
Experimental |
| 98 |
motasemwed/llm-judge
LLM-as-a-Judge system for rubric-based, explainable evaluation of large... |
|
Experimental |
| 99 |
YaswanthGhanta/llm-logical-integrity-benchmark
Adversarial testing of LLMs on constraint satisfaction deadlocks |
|
Experimental |
| 100 |
OptionalSoftware/concurrent
The Multi-LLM Benchmarking Tool |
|
Experimental |
| 101 |
ghazaleh-mahmoodi/Prompting_LLMs_AS_Explainable_Metrics
Eval4NLP Shared Task on Prompting Large Language Models as Explainable Metrics |
|
Experimental |