LLM Evaluation Benchmarking NLP Tools
Tools and frameworks for evaluating, benchmarking, and scoring large language model outputs across various dimensions (accuracy, reasoning, semantic understanding, consistency). Includes automated metrics, evaluation harnesses, and comparative testing frameworks. Does NOT include model training, fine-tuning, adaptation, or general NLP task evaluation unrelated to LLM assessment.
There are 120 llm evaluation benchmarking tools tracked. 1 score above 70 (verified tier). The highest-rated is google/langfun at 78/100 with 900 stars and 33,444 monthly downloads. 1 of the top 10 are actively maintained.
Get all 120 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=llm-evaluation-benchmarking&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
google/langfun
OO for LLMs |
|
Verified |
| 2 |
tanaos/artifex
Small Language Model Inference, Fine-Tuning and Observability. No GPU, no... |
|
Established |
| 3 |
vulnerability-lookup/VulnTrain
A tool to generate datasets and models based on vulnerabilities descriptions... |
|
Established |
| 4 |
DataScienceUIBK/HintEval
HintEval💡: A Comprehensive Framework for Hint Generation and Evaluation for Questions |
|
Established |
| 5 |
microsoft/LMChallenge
A library & tools to evaluate predictive language models. |
|
Established |
| 6 |
preligens-lab/textnoisr
Adding random noise to a text dataset, and controlling very accurately the... |
|
Established |
| 7 |
masakhane-io/masakhane-mt
Machine Translation for Africa |
|
Established |
| 8 |
EleanorJiang/BlonDe
Official implementations for (1) BlonDe: An Automatic Evaluation Metric for... |
|
Established |
| 9 |
Maluuba/nlg-eval
Evaluation code for various unsupervised automated metrics for Natural... |
|
Emerging |
| 10 |
disi-unibo-nlp/nlg-metricverse
[COLING22] An End-to-End Library for Evaluating Natural Language Generation |
|
Emerging |
| 11 |
feralvam/easse
Easier Automatic Sentence Simplification Evaluation |
|
Emerging |
| 12 |
wasiahmad/PLBART
Official code of our work, Unified Pre-training for Program Understanding... |
|
Emerging |
| 13 |
gcunhase/NLPMetrics
Python code for various NLP metrics |
|
Emerging |
| 14 |
olivettigroup/materials-synthesis-generative-models
Public release of data and code for materials synthesis generation |
|
Emerging |
| 15 |
LIAAD/tieval
An Evaluation Framework for Temporal Information Extraction Systems |
|
Emerging |
| 16 |
Lambda-3/DiscourseSimplification
Extension of the SentenceSimplification project |
|
Emerging |
| 17 |
dataset-sh/slambda
We turn instruction and examples into plain python function powered by LLM. |
|
Emerging |
| 18 |
microsoft/Litmus
AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems |
|
Emerging |
| 19 |
abasirat/llm-adapter
A plug-and-play adapter architecture that efficiently adapts large language... |
|
Emerging |
| 20 |
IIIIQIIII/DramaBench
A six-dimensional evaluation framework for drama script continuation with... |
|
Emerging |
| 21 |
Kyle-Ross/glyphdeck
The glyphdeck library is a comprehensive toolkit designed to streamline &... |
|
Emerging |
| 22 |
golsun/SpaceFusion
NAACL'19: "Jointly Optimizing Diversity and Relevance in Neural Response Generation" |
|
Emerging |
| 23 |
zjunlp/MemBase
A Comprehensive Benchmarking Framework for Long-Term Conversational Memory Layers |
|
Emerging |
| 24 |
Joinn99/RocketEval-ICLR
🚀 [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist |
|
Emerging |
| 25 |
4AI/langml
A Keras-based and TensorFlow-backend NLP Models Toolkit. |
|
Emerging |
| 26 |
Sanqiang/text_simplification
Text Simplification Model based on Encoder-Decoder (includes Transformer and... |
|
Emerging |
| 27 |
psunlpgroup/ReaLMistake
This repository includes a benchmark and code for the paper "Evaluating LLMs... |
|
Emerging |
| 28 |
explosion/prodigy-openai-recipes
✨ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3 |
|
Emerging |
| 29 |
bassrehab/spark-llm-eval
Spark-native LLM evaluation framework with confidence intervals,... |
|
Emerging |
| 30 |
VityaVitalich/TaxoLLaMA
[ACL 2024] TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Sematic Tasks |
|
Emerging |
| 31 |
namwonss/Math-Solver
Classifier for math word problems using deep learning |
|
Emerging |
| 32 |
sileod/Discovery
Mining Discourse Markers for Unsupervised Sentence Representation Learning |
|
Emerging |
| 33 |
2030NLP/SpaCE2021
中文空间语义理解评测 |
|
Experimental |
| 34 |
BM-K/KoMiniLM
Korean Light Weight Language Model |
|
Experimental |
| 35 |
doheejin/HiPAMA
This repository is the implementation of the HiPAMA architecture, introduced... |
|
Experimental |
| 36 |
rashad101/RoMe
PyTorch code for ACL 2022 paper: RoMe: A Robust Metric for Evaluating... |
|
Experimental |
| 37 |
SapienzaNLP/guardians-mt-eval
Official repository of the ACL 2024 paper "Guardians of the Machine... |
|
Experimental |
| 38 |
USC-FORTIS/NLP-ADBench
[EMNLP Findings 2025]. NLP-ADBench is a comprehensive benchmarking tool... |
|
Experimental |
| 39 |
Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks
Repository for code underlying the paper 'Assessing the Impact of OCR... |
|
Experimental |
| 40 |
ksanu1998/static_analysis_codegen_llms
This repository contains code base for project titled Leveraging static... |
|
Experimental |
| 41 |
IIT-DM/BattleofLLMs
Benchmarks of LLMs with Conversational QA datasets. |
|
Experimental |
| 42 |
JonnoB/training_lms_with_synthetic_data
A repo for training Language models to correct errors in OCR text |
|
Experimental |
| 43 |
roboalchemist/dynamic-baml
Python library for dynamic BAML schema generation and LLM structured data... |
|
Experimental |
| 44 |
zy-liu/POSSCORE
This repo is for POSSCORE, an automatic evaluation metric for the... |
|
Experimental |
| 45 |
feralvam/metaeval-simplification
Meta-evaluation of automatic metrics in Text Simplification |
|
Experimental |
| 46 |
miserytale/Little_Language_Model
LittleLM: A tiny character-level n-gram language model for local corpus... |
|
Experimental |
| 47 |
JINO-ROHIT/tachyon
a LLM inference engine to run on consumer hardware |
|
Experimental |
| 48 |
doc-analysis/ReadingBank
ReadingBank: A Benchmark Dataset for Reading Order Detection |
|
Experimental |
| 49 |
OSU-NLP-Group/SELM
Symmetric Encryption with Language Models |
|
Experimental |
| 50 |
davidheineman/salsa
Success and Failure Linguistic Simplification Annotation 💃 |
|
Experimental |
| 51 |
language-brainscore/langbrainscore
[Marked for Deprecation. please visit... |
|
Experimental |
| 52 |
lmvasque/ts-explore
Source code for Text Simplification Evaluation papers at ACL findings and... |
|
Experimental |
| 53 |
subramanya1997/Novel-T5
We propose to use a mode that favors sentiment understanding and empathetic... |
|
Experimental |
| 54 |
koguma100/LLM_Prompt_Injection_Capstone
Capstone project for WSU's computer science major with a focus on... |
|
Experimental |
| 55 |
JonnoB/scrambledtext
A python library for creating synthetic corrupted OCR text using a markov process |
|
Experimental |
| 56 |
Lambda-3/SentenceSimplification
Tool to simplify english sentences into their core and context sentences |
|
Experimental |
| 57 |
greg2451/aggregating-text-similarity-metrics
This repository consists of a benchmark of various text similarity measures... |
|
Experimental |
| 58 |
civillibertarian-stressincontinence617/llm-autoeval
🛠️ Simplify LLM evaluation with our Colab notebook; just name your model,... |
|
Experimental |
| 59 |
RAravindDS/CharLLMs
Implementing easy to use "Character Level Language Models" 🕺🏽 |
|
Experimental |
| 60 |
licphel/LLMe
LLM trainer for personal computers. |
|
Experimental |
| 61 |
saarus72/text_normalization
T5-based (russian) text normalization |
|
Experimental |
| 62 |
liamcripwell/control_simp
Code and resources for controllable simplification via operation classification. |
|
Experimental |
| 63 |
anto18671/lumenspark
Lumenspark is a lightweight Linformer-based Language Model Trained from Scratch |
|
Experimental |
| 64 |
Omg1221/search_evals
🔍 Evaluate web search APIs with our framework, testing accuracy and... |
|
Experimental |
| 65 |
Kaito1999-script/ULMEvalKit
🛠️ Evaluate unified models effortlessly with ULMEvalKit, your open-source... |
|
Experimental |
| 66 |
balajeekalyan/figureout
FigureOut is a Python package allows developers to easily integrate LLM into... |
|
Experimental |
| 67 |
devxiongmao/llm-scorecaster
LLM-Scorecaster is a Python-based system designed to evaluate and analyze... |
|
Experimental |
| 68 |
seclab-yonsei/mia-ko-lm
Performing membership inference attack (MIA) against Korean language models (LMs). |
|
Experimental |
| 69 |
11NOel11/ChaosBench-Logic
Benchmark dataset and tooling for evaluating LLM logical reasoning and... |
|
Experimental |
| 70 |
sileod/DiscSense
Automated Semantic Analysis of Discourse Markers |
|
Experimental |
| 71 |
megagonlabs/holobench
🫧 Code for Holistic Reasoning with Long-Context LMs: A Benchmark for... |
|
Experimental |
| 72 |
doheejin/SB_loss_PA
This repository is the implementation of the paper, "Score-balanced Loss for... |
|
Experimental |
| 73 |
gsbm/minilm
A lightweight toolkit for experimenting with compact language models |
|
Experimental |
| 74 |
lancopku/meSimp
Codes for "Training Simplification and Model Simplification for Deep... |
|
Experimental |
| 75 |
wa3dbk/llm-batch
LLM Inference CLI - Batch inference with customizable templates |
|
Experimental |
| 76 |
soldni/tokreate
A minimal library to create tokens using LLMs. |
|
Experimental |
| 77 |
idramalab/quantify-llm-explanations
Evaluating Large Language Models for Detecting Antisemitism |
|
Experimental |
| 78 |
chrischenhub/OnlySportsLM
SOTA Sports-domain Language Model under Billion Parameters |
|
Experimental |
| 79 |
rafaelsandroni/gpt3-data-labeling
Data labeling using few shot learning GPT-3. |
|
Experimental |
| 80 |
princeton-nlp/blindfold-textgame
[NAACL 2021] Reading and Acting while Blindfolded: The Need for Semantics in... |
|
Experimental |
| 81 |
kaganhitit11/mergeval
mergeval is a unified tool that lets you merge and evaluate large language... |
|
Experimental |
| 82 |
yancong222/LMs-discourse-connectives-Surprisals
On the Influence of Discourse Connectives on the Predictions of Humans and... |
|
Experimental |
| 83 |
yancong222/ClinicalNLP2024
Python code for LLMs surprisals and linear machine learning models |
|
Experimental |
| 84 |
dsdanielpark/all-about-llm
dsdanielpark's curation and categorization of resources on large language... |
|
Experimental |
| 85 |
ossirytk/llm_resources
Information and resources on everything related about running large language... |
|
Experimental |
| 86 |
ehs9nino/traffic-ocr-llm-benchmark
Benchmark dataset for OCR + LLM document understanding in traffic and... |
|
Experimental |
| 87 |
BramVanroy/mai-simplification-nl-2023
Sentence-Level Text Simplification for Dutch |
|
Experimental |
| 88 |
D0men1c0/Benchmark-Gemma-Models
Highly customizable Python suite for LLM evaluation (Gemma, LLaMA+). Full... |
|
Experimental |
| 89 |
ylkhayat/cocolex
[ACL 2025] Codebase for CoCoLex |
|
Experimental |
| 90 |
soualahmohammedzakaria/Fuzzy-LM
Minimal implementation of a language model with fuzzy word matching. |
|
Experimental |
| 91 |
somsubhra04/LLM_Legal_Prompt_Generation
Data and codes for the EMNLP 2023 paper 'LLMs – the Good, the Bad or the... |
|
Experimental |
| 92 |
OasisSimpDataset/OasisSimpDataset.github.io
OasisSimp: An Open-source Asian-English Sentence Simplification Dataset |
|
Experimental |
| 93 |
alphadl/EasyBLEU
An effective and simple tool to calculate SacreBLEU, Token-BLEU, BLEU w/... |
|
Experimental |
| 94 |
zircote/oolong-pairs
Benchmark harness for A/B testing Claude Code plugins against OOLONG... |
|
Experimental |
| 95 |
BetterAndBetterII/effimemo
A Python package for managing large language model (LLM) context windows,... |
|
Experimental |
| 96 |
VoxDroid/Zylthra
Zylthra: A PyQt6 app to generate synthetic datasets with DataLLM. |
|
Experimental |
| 97 |
audreycs/ImpScore
A repository for paper ImpScore: A Learnable Metric For Quantifying The... |
|
Experimental |
| 98 |
harvey-fin/absence-bench
Code implementation for paper AbsenceBench: Language Models Can't Tell What's Missing |
|
Experimental |
| 99 |
YecanLee/2BeOETG
[ACL 2025 Workshop] Official PyTorch Implementation of "Towards Better... |
|
Experimental |
| 100 |
MChatzakis/ChatMGL
ChatMGL: A Large Language Model Fine-tuned for Data Science Questions. |
|
Experimental |
| 101 |
ebarkhordar/voter-behavior-prediction-LLM
This project explores the predictive power of large language models (LLMs)... |
|
Experimental |
| 102 |
bionlplab/isimp
A sentence simplification system |
|
Experimental |
| 103 |
baojunshan/nlg-metrics
Natural language generation evaluation metrics |
|
Experimental |
| 104 |
orionw/LM-expansions
When do Generative Query and Document Expansions Fail? A Comprehensive Study... |
|
Experimental |
| 105 |
alexfdez1010/ner-llm
A system for doing NER using LLMs and LRMs |
|
Experimental |
| 106 |
sashsinha/nqmp-bench
NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity... |
|
Experimental |
| 107 |
Kseymur/eltex-sheets-addon
Google Sheets add-on for domain-driven synthetic data generation using LLMs. |
|
Experimental |
| 108 |
JonnoB/scrambledtext_analysis
Can synthetic corrupted data be used to train LLM's to correct OCR text? |
|
Experimental |
| 109 |
DFKI-NLP/LLMCheckup
Code for the NAACL 2024 HCI+NLP Workshop paper "LLMCheckup: Conversational... |
|
Experimental |
| 110 |
cx0/llm-typos
Impact of typos and common misspellings on LLM task performance. |
|
Experimental |
| 111 |
codingClaire/Structural-Code-Understanding
A Survey of Deep Learning Models for Structural Code Understanding |
|
Experimental |
| 112 |
sileod/pragmeval
Discourse Based Evaluation of Language Understanding |
|
Experimental |
| 113 |
alok/llmvision
Visualize how LLMs tokenize text - see the world through the eyes of language models |
|
Experimental |
| 114 |
Haiku-Legal/legaleval
LegalEval, high level framework for evaluation of legal LLMs and reasoning... |
|
Experimental |
| 115 |
icecola12/AgenticPOIBench-A-Realistic-Benchmark-for-Agentic-Spatiotemporal-Constrained-POI-Search
AgenticPOIBench: A Realistic Benchmark for Agentic... |
|
Experimental |
| 116 |
pthompson8594/SemanticUTF8
UTF-8 language model compression achieving ~66% token reduction while... |
|
Experimental |
| 117 |
glaciapag/locallm
A simple Python package that lets you interact with a large language model... |
|
Experimental |
| 118 |
daskol/lsp-lm
Language Model as a Language Server |
|
Experimental |
| 119 |
inteldict/CatEval
tool for constituency parsing evaluation |
|
Experimental |
| 120 |
erayyap/lats-for-ollama
A primitive and an inefficient implementation of LATS for usage alongside... |
|
Experimental |