LLM Evaluation Frameworks Prompt Engineering Tools

Systematic benchmarking and testing suites for evaluating LLM prompt strategies, output quality, consistency, and factuality across multiple models and tasks. Does NOT include prompt optimization tools, hallucination-reduction techniques alone, or general LLM deployment platforms.

There are 101 llm evaluation frameworks tools tracked. 1 score above 70 (verified tier). The highest-rated is microsoft/promptbench at 70/100 with 2,785 stars and 288 monthly downloads.

Get all 101 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=prompt-engineering&subcategory=llm-evaluation-frameworks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	microsoft/promptbench A unified evaluation framework for large language models	70	Verified	2,785	Python
2	uptrain-ai/uptrain UpTrain is an open-source unified platform to evaluate and improve...	55	Established	2,339	Python
3	microsoftarchive/promptbench A unified evaluation framework for large language models	45	Emerging	2,787	Python
4	gabe-mousa/Apolien AI Safety Evaluation Library	45	Emerging	5	Python
5	levitation-opensource/Manipulative-Expression-Recognition MER is a software that identifies and highlights manipulative communication...	38	Emerging	14	HTML
6	PromptMixerDev/prompt-mixer-app-ce A desktop application for comparing outputs from different Large Language...	34	Emerging	84	TypeScript
7	GSA/FedRAMP-OllaLab-Lean The OllaLab-Lean project is designed to help both novice and experienced...	34	Emerging	29	Jupyter Notebook
8	babelcloud/LLM-RGB LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios...	34	Emerging	166	TypeScript
9	ryoungj/ToolEmu [ICLR'24 Spotlight] A language model (LM)-based emulation framework for...	33	Emerging	192	Python
10	kiyoshisasano/llm-failure-atlas A graph-based failure modeling and deterministic detection system for LLM...	32	Emerging	1	Python
11	ozturkoktay/insurance-llm-framework An interactive framework for experimenting with and evaluating open-source...	30	Emerging	13	Python
12	syamsasi99/prompt-evaluator prompt-evaluator is an open-source toolkit for evaluating, testing, and...	30	Emerging	4	TypeScript
13	fau-masters-collected-works-cgarbin/llm-comparison-tool A tool to compare multiple large language models (LLMs) side by side	26	Experimental	6	Python
14	realadeel/llm-test-bench Compare LLM providers (OpenAI, Claude, Gemini) for vision tasks - benchmark...	26	Experimental	4	Python
15	mary-lev/llm-ocr LLM-powered OCR evaluation and correction package that supports multiple...	26	Experimental	4	Python
16	pablo-chacon/Spoon-Bending Educational analysis of LLM alignment, safety behavior, and...	25	Experimental	22	—
17	sidoody/heart-context-pack Compiling the HEART Score into a structured, model-facing policy artifact...	23	Experimental	1	Python
18	joshualamerton/Modelbench Concept: benchmarking harness for prompts, models, and agent strategies	23	Experimental	1	Python
19	SyntagmaNull/judgment-hygiene-stack Tri-skill framework for structure routing, evidence discipline, and judgment...	23	Experimental	1	—
20	jameswniu/self-hosted-llm-evals-lab Evaluation framework for self-hosted LLMs. Systematic prompt ablation...	23	Experimental	1	Python
21	GnomeMan4201/drift-artifact Stylometric drift experiment — documents that demonstrate iterative...	23	Experimental	1	HTML
22	lpr021/redteam-ai-benchmark 🧪 Evaluate uncensored LLMs for offensive security with targeted questions...	23	Experimental	1	Python
23	reiidoda/OpenRe Open-source AI agent evaluation workbench for benchmarking, tracing,...	22	Experimental	—	Python
24	aaddii09/llm-eval-harness 🔍 Run efficient evaluations for prompt and LLM regression testing with this...	22	Experimental	—	Python
25	AspenXDev/job-evaluation-engine Modular prompt-engineered system for deterministic job evaluation with...	22	Experimental	—	—
26	MarcKarbowiak/ai-evaluation-harness Production-minded evaluation harness for LLM features with structured...	22	Experimental	—	Python
27	kogunlowo123/ai-evaluation-prompts Prompt evaluation framework with accuracy, coherence, safety rubrics, and...	22	Experimental	—	Python
28	kanupriya-GuptaM/llm-agreement-bias-benchmark Benchmark framework for detecting agreement bias and answer instability in...	22	Experimental	—	—
29	paradite/eval-data Prompts and evaluation data for LLMs on real world coding and writing tasks	22	Experimental	17	TypeScript
30	EviAmarates/fresta-edge Domain evaluation lens generator built on the Fresta Lens Framework	22	Experimental	3	Python
31	adityaarunsinghal/LLM-As-A-Judge-Prompt-Improver Scientific framework for iterative LLM prompt improvement using...	22	Experimental	—	Python
32	mohosy/OpenEvals Open-source eval studio for prompt comparisons, regression tracking, and...	22	Experimental	—	Python
33	MVidicek/evalkit Test your prompts like you test your code. Regression testing for LLM applications.	22	Experimental	—	Python
34	Amir-ElBelawy/llm-failure-mode-taxonomy A practitioner's taxonomy of recurring failure patterns in large language...	22	Experimental	—	—
35	chirindaopensource/auditable_AI_agent_loop_for_empirical_economics End-to-End Python implementation of Shin (2026)'s evaluator-locked agentic...	22	Experimental	—	Jupyter Notebook
36	deadbits/trs 🔭 Threat report analysis via LLM and Vector DB	22	Experimental	9	Python
37	hsieh89t-cloud/legal-agent-reliability-benchmark Reliability and hallucination mitigation research for tool-augmented legal...	22	Experimental	—	—
38	hideyuki001/unified-cognitive-os-v1.8 Judgment decomposition architecture for translation QA, ASR review, AI...	22	Experimental	—	—
39	kustonaut/llm-eval-kit Quality scoring, eval suites, and regression detection for LLM outputs.	22	Experimental	—	Python
40	kepiCHelaSHen/context-hacking Turn LLM priors into scientific rigor. Zero-drift multi-agent framework for...	22	Experimental	—	Python
41	IgnazioDS/evalops-workbench A local-first evaluation harness for prompts, tools, and agents with...	22	Experimental	—	HTML
42	Chunduri-Aditya/Model-Behavior-Lab Local Ollama-based LLM evaluation platform that benchmarks reasoning,...	20	Experimental	1	Python
43	petersimmons1972/brutal-evaluation AI skill for brutally honest project feedback. Based on Dylan Davis's BRUTAL...	20	Experimental	1	—
44	maxpetrusenko/llm-eval-notes Public LLM evaluation artifacts: hallucination, brittleness, structured...	20	Experimental	1	Python
45	Ravevx/LLM-Spatial-Reasoning-Evaluation-2D-Physics-Puzzle A benchmark environment for evaluating large language models’ spatial...	20	Experimental	1	HTML
46	tpertner/squeeze Squeeze your model with pressure prompts to see if its behavior leaks.	19	Experimental	—	Python
47	michaelflppv/prompt-llm-benchmark Prompt LLM Bench is a platform that discovers compatible Hugging Face models...	19	Experimental	—	TypeScript
48	hirbis/prompt-governance Replication package for "Prompt Governance in Financial AI" (Girolli, 2026)....	19	Experimental	—	Python
49	gwasiakshay/llm-eval-benchmark LLM evaluation & benchmarking framework using LLM-as-a-judge scoring,...	19	Experimental	—	Python
50	vivek8849/llm-trust-evaluator A production-ready framework for evaluating LLM reliability using semantic...	19	Experimental	—	Jupyter Notebook
51	aleremfer/prompt-eval-cases Prompt comparison and evaluation across multiple LLMs (EN/ES)	19	Experimental	—	—
52	aikenkyu001/semantic_roundtrip_benchmark_2 This repository contains the primary contributions of our research paper, "A...	19	Experimental	—	TeX
53	firechair/AI-Engineering-Critique 🚀 An interactive platform for LLM Preference Learning and Comparative...	19	Experimental	—	Python
54	Philipnil06/ai-output-quality-lab A structured experiment framework for prompt variation, evaluation, and...	19	Experimental	—	Python
55	LeNguyenAnhKhoa/Hallucination-Detection Hallucination Detection using LLM's API	18	Experimental	4	Jupyter Notebook
56	thuanystuart/DD3412-chain-of-verification-reproduction Re-implementation of the paper "Chain-of-Verification Reduces Hallucination...	18	Experimental	6	Python
57	r4u-dev/open-r4u Optimize AI & Maximize ROI of your LLM tasks. Evaluates current state and...	18	Experimental	3	Python
58	GTMVP/modal-llm-evaluator Run 1,000 LLM evaluations in 10 minutes. Test prompts across Claude, GPT-4,...	16	Experimental	1	Python
59	vihanga/prompt-sandbox Testing framework for LLM prompts. Started as a weekend project after...	15	Experimental	—	Python
60	aikenkyu001/benchmarking_llm_against_prompt_formats Official experimental environment for 'Benchmarking LLM Sensitivity to...	15	Experimental	—	Python
61	moses-shenassa/llm-prompt-framework-and-eval-suite Prompt engineering framework + evaluation harness for LLM workflows...	15	Experimental	—	Python
62	flamehaven01/CRoM-EfficientLLM A Python toolkit to optimize LLM context by intelligently selecting,...	15	Experimental	—	Python
63	antzedek/dar-quickfix Runtime patch that kills LLM loops, drift & hallucinations in real-time –...	15	Experimental	—	Python
64	lkilefner/llm-quality-evaluation-examples K–12 LLM evaluation examples using teacher-centered ground truths, rubrics,...	15	Experimental	—	—
65	Codegrammer999/prompt-bench This is a benchmark suite comparing zero-shot, few-shot, Chain-of-Thought,...	15	Experimental	1	Python
66	FlosMume/LLM-Safety-Labs-Starter Foundation for building safer generative-AI systems — includes example...	15	Experimental	—	Python
67	rahul-sg/HondaResearchLabs_DSC180A-Eval-Systems-Of-NextGen-LLMs Domain-aware LLM summary evaluation and iterative refinement pipeline with...	15	Experimental	1	Python
68	ktjkc/reflextrust 🧠 LLMs don’t just process text — they read the room. Meaning emerges through...	14	Experimental	4	—
69	sportixIndia/LBOS-LCAS-LP-Contradiction-tracker 🔍 Track contradictions in AI and human content with LBOS-LCAS, enhancing...	14	Experimental	—	Python
70	antsuebae/TFG-LLM-RE TFG: Evaluación comparativa de LLMs locales vs. cloud en Ingeniería de...	14	Experimental	—	Python
71	bensonbabu93/llm-prompt-evaluation-framework A prompt experimentation tool that benchmarks LLM responses across multiple...	14	Experimental	—	Python
72	YifanHe0126/medical-mllm-evaluation Evaluation and model selection workflow for open-source multimodal LLMs in...	14	Experimental	—	—
73	AW-VB/llm-mcq-benchmark Benchmarking open-weight LLMs on multiple-choice QA with prompt comparison,...	14	Experimental	—	Jupyter Notebook
74	rechriti/llm-risk-analysis LLM-based risk analysis system using prompt engineering and evaluation (NDA-safe)	14	Experimental	—	—
75	rahulthadhani/llm-benchmark A benchmark suite that tests how zero-shot, few-shot, chain-of-thought, and...	14	Experimental	—	Python
76	illogical/LMEval Web application for systematic prompt engineering and model evaluation	14	Experimental	—	TypeScript
77	jharter-stack/prompt-evals prompt-evals — Prompt testing, comparisons, refinements, and failure cases	14	Experimental	—	—
78	gamzeakkurt/Prompt-Evaluation-in-AWS-Bedrock Prompt evaluation framework using AWS Bedrock to assess LLM outputs with...	14	Experimental	—	—
79	wzy6642/I3C-Select Official implementation for "Instructing Large Language Models to Identify...	13	Experimental	8	Python
80	ghazal001/LLM-C-Grading-Agent Ongoing LLM-based grading agent for automated evaluation of C++ programming...	13	Experimental	—	Python
81	Ziechoes/reasoning-invariance-benchmark Experiments testing whether LLM reasoning trajectories remain invariant when...	12	Experimental	1	Python
82	useentropy/llmkit LLM Kit - Python Large Language Model Kit for generating data of your choice	12	Experimental	4	Python
83	BOSSMAN-dev89/LBOS-LCAS-LP-Contradiction-tracker A tool for auditing bias through large language models	12	Experimental	1	Python
84	rlin25/FrizzlesRubric A modular system for automated, multi-metric AI prompt evaluation—featuring...	12	Experimental	1	Python
85	chirindaopensource/llm_faithfulness_hallucination_misalignment_detection End-to-End Python implementation of Semantic Divergence Metrics (SDM) for...	12	Experimental	1	Jupyter Notebook
86	yuchenzhu-research/iclr2026-cao-prompt-drift-lab A reproducible evaluation framework for studying how small prompt variations...	11	Experimental	—	TeX
87	sergeyklay/factly CLI tool to evaluate LLM factuality on MMLU benchmark.	11	Experimental	2	Python
88	noah-art3mis/crucible Develop better LLM apps by testing different models and prompts in bulk.	11	Experimental	2	Python
89	GoodCODER280722/llm-output-validator Rule-based AI output validation CLI tool (mock mode) with structured JSON reporting.	11	Experimental	—	Python
90	jadhav045/DeepStack-AILM-Assignment A strict, provider-agnostic User Input Validator powered exclusively by LLMs...	11	Experimental	—	Python
91	SiemonCha/ECM3401-LLM-Essay-Scoring Measuring semantic robustness in LLM-based CEFR essay scoring through...	11	Experimental	—	Python
92	mtchynkstff/llm-ed-eval A reproducible evaluation framework analyzing how prompt strategies affect...	11	Experimental	—	—
93	1rajatk/content-judgment-calibrator A judgment calibration framework for auditing content clarity, credibility,...	11	Experimental	—	—
94	Laksh-star/ai-fluency-gym Educational AI fluency self-assessment inspired by the 4D framework, with...	11	Experimental	—	TypeScript
95	KSVQ/openrouter-harness Lightweight OpenRouter evaluation harness with web UI, batch runs, and a...	11	Experimental	—	Python
96	eugeniusms/TextualVerifier LLM-Based Textual Verifier using Chain-of-Thought, Variant Generation, and...	11	Experimental	—	Python
97	TheSkyBiz/llm-persona-drift-evaluation 945-generation adversarial evaluation of 3 open LLMs across 3 personas and...	11	Experimental	—	Python
98	motasemwed/llm-judge LLM-as-a-Judge system for rubric-based, explainable evaluation of large...	11	Experimental	—	Python
99	YaswanthGhanta/llm-logical-integrity-benchmark Adversarial testing of LLMs on constraint satisfaction deadlocks	11	Experimental	—	—
100	OptionalSoftware/concurrent The Multi-LLM Benchmarking Tool	10	Experimental	3	—
101	ghazaleh-mahmoodi/Prompting_LLMs_AS_Explainable_Metrics Eval4NLP Shared Task on Prompting Large Language Models as Explainable Metrics	10	Experimental	1	Python

Comparisons in this category

promptbench and prompt-evaluator (70 vs 30) promptbench and Modelbench (70 vs 23)