LLM Evaluation Benchmarking NLP Tools

Tools and frameworks for evaluating, benchmarking, and scoring large language model outputs across various dimensions (accuracy, reasoning, semantic understanding, consistency). Includes automated metrics, evaluation harnesses, and comparative testing frameworks. Does NOT include model training, fine-tuning, adaptation, or general NLP task evaluation unrelated to LLM assessment.

There are 120 llm evaluation benchmarking tools tracked. 1 score above 70 (verified tier). The highest-rated is google/langfun at 78/100 with 900 stars and 33,444 monthly downloads. 1 of the top 10 are actively maintained.

Get all 120 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	google/langfun OO for LLMs	78	Verified	900	Python
2	tanaos/artifex Small Language Model Inference, Fine-Tuning and Observability. No GPU, no...	64	Established	90	Python
3	vulnerability-lookup/VulnTrain A tool to generate datasets and models based on vulnerabilities descriptions...	56	Established	23	Python
4	DataScienceUIBK/HintEval HintEval💡: A Comprehensive Framework for Hint Generation and Evaluation for Questions	53	Established	36	Python
5	microsoft/LMChallenge A library & tools to evaluate predictive language models.	53	Established	65	Python
6	preligens-lab/textnoisr Adding random noise to a text dataset, and controlling very accurately the...	53	Established	20	Python
7	masakhane-io/masakhane-mt Machine Translation for Africa	51	Established	312	Lua
8	EleanorJiang/BlonDe Official implementations for (1) BlonDe: An Automatic Evaluation Metric for...	50	Established	83	Python
9	Maluuba/nlg-eval Evaluation code for various unsupervised automated metrics for Natural...	49	Emerging	1,391	Python
10	disi-unibo-nlp/nlg-metricverse [COLING22] An End-to-End Library for Evaluating Natural Language Generation	48	Emerging	94	Python
11	feralvam/easse Easier Automatic Sentence Simplification Evaluation	47	Emerging	166	Roff
12	wasiahmad/PLBART Official code of our work, Unified Pre-training for Program Understanding...	46	Emerging	186	Python
13	gcunhase/NLPMetrics Python code for various NLP metrics	44	Emerging	169	Jupyter Notebook
14	olivettigroup/materials-synthesis-generative-models Public release of data and code for materials synthesis generation	44	Emerging	75	HTML
15	LIAAD/tieval An Evaluation Framework for Temporal Information Extraction Systems	43	Emerging	20	Python
16	Lambda-3/DiscourseSimplification Extension of the SentenceSimplification project	42	Emerging	61	Java
17	dataset-sh/slambda We turn instruction and examples into plain python function powered by LLM.	37	Emerging	3	Python
18	microsoft/Litmus AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems	37	Emerging	48	Python
19	abasirat/llm-adapter A plug-and-play adapter architecture that efficiently adapts large language...	37	Emerging	3	Python
20	IIIIQIIII/DramaBench A six-dimensional evaluation framework for drama script continuation with...	36	Emerging	84	HTML
21	Kyle-Ross/glyphdeck The glyphdeck library is a comprehensive toolkit designed to streamline &...	35	Emerging	2	Python
22	golsun/SpaceFusion NAACL'19: "Jointly Optimizing Diversity and Relevance in Neural Response Generation"	35	Emerging	73	Python
23	zjunlp/MemBase A Comprehensive Benchmarking Framework for Long-Term Conversational Memory Layers	34	Emerging	11	Python
24	Joinn99/RocketEval-ICLR 🚀 [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist	34	Emerging	15	Python
25	4AI/langml A Keras-based and TensorFlow-backend NLP Models Toolkit.	34	Emerging	12	Python
26	Sanqiang/text_simplification Text Simplification Model based on Encoder-Decoder (includes Transformer and...	34	Emerging	68	Python
27	psunlpgroup/ReaLMistake This repository includes a benchmark and code for the paper "Evaluating LLMs...	32	Emerging	31	Python
28	explosion/prodigy-openai-recipes ✨ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3	32	Emerging	322	Python
29	bassrehab/spark-llm-eval Spark-native LLM evaluation framework with confidence intervals,...	31	Emerging	3	Python
30	VityaVitalich/TaxoLLaMA [ACL 2024] TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Sematic Tasks	30	Emerging	19	Python
31	namwonss/Math-Solver Classifier for math word problems using deep learning	30	Emerging	11	Python
32	sileod/Discovery Mining Discourse Markers for Unsupervised Sentence Representation Learning	30	Emerging	61	Jupyter Notebook
33	2030NLP/SpaCE2021 中文空间语义理解评测	29	Experimental	39	Python
34	BM-K/KoMiniLM Korean Light Weight Language Model	29	Experimental	31	Python
35	doheejin/HiPAMA This repository is the implementation of the HiPAMA architecture, introduced...	29	Experimental	38	Python
36	rashad101/RoMe PyTorch code for ACL 2022 paper: RoMe: A Robust Metric for Evaluating...	29	Experimental	10	Python
37	SapienzaNLP/guardians-mt-eval Official repository of the ACL 2024 paper "Guardians of the Machine...	29	Experimental	10	Python
38	USC-FORTIS/NLP-ADBench [EMNLP Findings 2025]. NLP-ADBench is a comprehensive benchmarking tool...	29	Experimental	21	Python
39	Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks Repository for code underlying the paper 'Assessing the Impact of OCR...	27	Experimental	9	Jupyter Notebook
40	ksanu1998/static_analysis_codegen_llms This repository contains code base for project titled Leveraging static...	26	Experimental	5	HTML
41	IIT-DM/BattleofLLMs Benchmarks of LLMs with Conversational QA datasets.	26	Experimental	6	Python
42	JonnoB/training_lms_with_synthetic_data A repo for training Language models to correct errors in OCR text	26	Experimental	2	Python
43	roboalchemist/dynamic-baml Python library for dynamic BAML schema generation and LLM structured data...	25	Experimental	2	Python
44	zy-liu/POSSCORE This repo is for POSSCORE, an automatic evaluation metric for the...	25	Experimental	5	Python
45	feralvam/metaeval-simplification Meta-evaluation of automatic metrics in Text Simplification	25	Experimental	4	Jupyter Notebook
46	miserytale/Little_Language_Model LittleLM: A tiny character-level n-gram language model for local corpus...	25	Experimental	4	Python
47	JINO-ROHIT/tachyon a LLM inference engine to run on consumer hardware	25	Experimental	3	Python
48	doc-analysis/ReadingBank ReadingBank: A Benchmark Dataset for Reading Order Detection	25	Experimental	117	—
49	OSU-NLP-Group/SELM Symmetric Encryption with Language Models	25	Experimental	13	Python
50	davidheineman/salsa Success and Failure Linguistic Simplification Annotation 💃	25	Experimental	5	Python
51	language-brainscore/langbrainscore [Marked for Deprecation. please visit...	25	Experimental	5	Python
52	lmvasque/ts-explore Source code for Text Simplification Evaluation papers at ACL findings and...	24	Experimental	3	Python
53	subramanya1997/Novel-T5 We propose to use a mode that favors sentiment understanding and empathetic...	24	Experimental	3	Jupyter Notebook
54	koguma100/LLM_Prompt_Injection_Capstone Capstone project for WSU's computer science major with a focus on...	24	Experimental	2	Python
55	JonnoB/scrambledtext A python library for creating synthetic corrupted OCR text using a markov process	24	Experimental	9	Python
56	Lambda-3/SentenceSimplification Tool to simplify english sentences into their core and context sentences	23	Experimental	6	Java
57	greg2451/aggregating-text-similarity-metrics This repository consists of a benchmark of various text similarity measures...	23	Experimental	2	Jupyter Notebook
58	civillibertarian-stressincontinence617/llm-autoeval 🛠️ Simplify LLM evaluation with our Colab notebook; just name your model,...	23	Experimental	1	Python
59	RAravindDS/CharLLMs Implementing easy to use "Character Level Language Models" 🕺🏽	23	Experimental	6	Python
60	licphel/LLMe LLM trainer for personal computers.	23	Experimental	1	Python
61	saarus72/text_normalization T5-based (russian) text normalization	23	Experimental	26	Jupyter Notebook
62	liamcripwell/control_simp Code and resources for controllable simplification via operation classification.	23	Experimental	2	Jupyter Notebook
63	anto18671/lumenspark Lumenspark is a lightweight Linformer-based Language Model Trained from Scratch	22	Experimental	1	Python
64	Omg1221/search_evals 🔍 Evaluate web search APIs with our framework, testing accuracy and...	22	Experimental	—	Python
65	Kaito1999-script/ULMEvalKit 🛠️ Evaluate unified models effortlessly with ULMEvalKit, your open-source...	22	Experimental	—	Python
66	balajeekalyan/figureout FigureOut is a Python package allows developers to easily integrate LLM into...	22	Experimental	—	Python
67	devxiongmao/llm-scorecaster LLM-Scorecaster is a Python-based system designed to evaluate and analyze...	22	Experimental	—	Python
68	seclab-yonsei/mia-ko-lm Performing membership inference attack (MIA) against Korean language models (LMs).	22	Experimental	7	Python
69	11NOel11/ChaosBench-Logic Benchmark dataset and tooling for evaluating LLM logical reasoning and...	22	Experimental	3	Python
70	sileod/DiscSense Automated Semantic Analysis of Discourse Markers	21	Experimental	11	—
71	megagonlabs/holobench 🫧 Code for Holistic Reasoning with Long-Context LMs: A Benchmark for...	20	Experimental	12	Python
72	doheejin/SB_loss_PA This repository is the implementation of the paper, "Score-balanced Loss for...	19	Experimental	22	Python
73	gsbm/minilm A lightweight toolkit for experimenting with compact language models	19	Experimental	—	Python
74	lancopku/meSimp Codes for "Training Simplification and Model Simplification for Deep...	19	Experimental	18	C#
75	wa3dbk/llm-batch LLM Inference CLI - Batch inference with customizable templates	19	Experimental	—	Python
76	soldni/tokreate A minimal library to create tokens using LLMs.	18	Experimental	6	Python
77	idramalab/quantify-llm-explanations Evaluating Large Language Models for Detecting Antisemitism	18	Experimental	4	Python
78	chrischenhub/OnlySportsLM SOTA Sports-domain Language Model under Billion Parameters	18	Experimental	7	Python
79	rafaelsandroni/gpt3-data-labeling Data labeling using few shot learning GPT-3.	18	Experimental	25	Jupyter Notebook
80	princeton-nlp/blindfold-textgame [NAACL 2021] Reading and Acting while Blindfolded: The Need for Semantics in...	18	Experimental	11	Python
81	kaganhitit11/mergeval mergeval is a unified tool that lets you merge and evaluate large language...	17	Experimental	2	Python
82	yancong222/LMs-discourse-connectives-Surprisals On the Influence of Discourse Connectives on the Predictions of Humans and...	16	Experimental	1	R
83	yancong222/ClinicalNLP2024 Python code for LLMs surprisals and linear machine learning models	16	Experimental	1	Python
84	dsdanielpark/all-about-llm dsdanielpark's curation and categorization of resources on large language...	16	Experimental	14	Python
85	ossirytk/llm_resources Information and resources on everything related about running large language...	16	Experimental	4	—
86	ehs9nino/traffic-ocr-llm-benchmark Benchmark dataset for OCR + LLM document understanding in traffic and...	16	Experimental	1	—
87	BramVanroy/mai-simplification-nl-2023 Sentence-Level Text Simplification for Dutch	15	Experimental	6	Python
88	D0men1c0/Benchmark-Gemma-Models Highly customizable Python suite for LLM evaluation (Gemma, LLaMA+). Full...	15	Experimental	5	Python
89	ylkhayat/cocolex [ACL 2025] Codebase for CoCoLex	15	Experimental	6	Python
90	soualahmohammedzakaria/Fuzzy-LM Minimal implementation of a language model with fuzzy word matching.	15	Experimental	1	Python
91	somsubhra04/LLM_Legal_Prompt_Generation Data and codes for the EMNLP 2023 paper 'LLMs – the Good, the Bad or the...	14	Experimental	7	Python
92	OasisSimpDataset/OasisSimpDataset.github.io OasisSimp: An Open-source Asian-English Sentence Simplification Dataset	14	Experimental	—	HTML
93	alphadl/EasyBLEU An effective and simple tool to calculate SacreBLEU, Token-BLEU, BLEU w/...	14	Experimental	7	Shell
94	zircote/oolong-pairs Benchmark harness for A/B testing Claude Code plugins against OOLONG...	14	Experimental	3	Python
95	BetterAndBetterII/effimemo A Python package for managing large language model (LLM) context windows,...	14	Experimental	3	Python
96	VoxDroid/Zylthra Zylthra: A PyQt6 app to generate synthetic datasets with DataLLM.	14	Experimental	4	Python
97	audreycs/ImpScore A repository for paper ImpScore: A Learnable Metric For Quantifying The...	14	Experimental	7	Python
98	harvey-fin/absence-bench Code implementation for paper AbsenceBench: Language Models Can't Tell What's Missing	13	Experimental	18	Python
99	YecanLee/2BeOETG [ACL 2025 Workshop] Official PyTorch Implementation of "Towards Better...	13	Experimental	5	R
100	MChatzakis/ChatMGL ChatMGL: A Large Language Model Fine-tuned for Data Science Questions.	13	Experimental	5	Jupyter Notebook
101	ebarkhordar/voter-behavior-prediction-LLM This project explores the predictive power of large language models (LLMs)...	13	Experimental	6	Jupyter Notebook
102	bionlplab/isimp A sentence simplification system	13	Experimental	8	Java
103	baojunshan/nlg-metrics Natural language generation evaluation metrics	13	Experimental	6	Python
104	orionw/LM-expansions When do Generative Query and Document Expansions Fail? A Comprehensive Study...	13	Experimental	5	Python
105	alexfdez1010/ner-llm A system for doing NER using LLMs and LRMs	13	Experimental	6	Python
106	sashsinha/nqmp-bench NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity...	12	Experimental	1	HTML
107	Kseymur/eltex-sheets-addon Google Sheets add-on for domain-driven synthetic data generation using LLMs.	12	Experimental	1	HTML
108	JonnoB/scrambledtext_analysis Can synthetic corrupted data be used to train LLM's to correct OCR text?	12	Experimental	1	Python
109	DFKI-NLP/LLMCheckup Code for the NAACL 2024 HCI+NLP Workshop paper "LLMCheckup: Conversational...	12	Experimental	13	Python
110	cx0/llm-typos Impact of typos and common misspellings on LLM task performance.	12	Experimental	19	Python
111	codingClaire/Structural-Code-Understanding A Survey of Deep Learning Models for Structural Code Understanding	11	Experimental	21	Python
112	sileod/pragmeval Discourse Based Evaluation of Language Understanding	11	Experimental	21	Jupyter Notebook
113	alok/llmvision Visualize how LLMs tokenize text - see the world through the eyes of language models	11	Experimental	—	Python
114	Haiku-Legal/legaleval LegalEval, high level framework for evaluation of legal LLMs and reasoning...	11	Experimental	—	—
115	icecola12/AgenticPOIBench-A-Realistic-Benchmark-for-Agentic-Spatiotemporal-Constrained-POI-Search AgenticPOIBench: A Realistic Benchmark for Agentic...	11	Experimental	—	—
116	pthompson8594/SemanticUTF8 UTF-8 language model compression achieving ~66% token reduction while...	11	Experimental	—	C#
117	glaciapag/locallm A simple Python package that lets you interact with a large language model...	10	Experimental	1	Python
118	daskol/lsp-lm Language Model as a Language Server	10	Experimental	1	Python
119	inteldict/CatEval tool for constituency parsing evaluation	10	Experimental	1	Python
120	erayyap/lats-for-ollama A primitive and an inefficient implementation of LATS for usage alongside...	10	Experimental	1	Jupyter Notebook