LLM Evaluation Benchmarking NLP Tools

Tools and frameworks for evaluating, benchmarking, and scoring large language model outputs across various dimensions (accuracy, reasoning, semantic understanding, consistency). Includes automated metrics, evaluation harnesses, and comparative testing frameworks. Does NOT include model training, fine-tuning, adaptation, or general NLP task evaluation unrelated to LLM assessment.

There are 120 llm evaluation benchmarking tools tracked. 1 score above 70 (verified tier). The highest-rated is google/langfun at 78/100 with 900 stars and 33,444 monthly downloads. 1 of the top 10 are actively maintained.

Get all 120 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 google/langfun

OO for LLMs

78
Verified
2 tanaos/artifex

Small Language Model Inference, Fine-Tuning and Observability. No GPU, no...

64
Established
3 vulnerability-lookup/VulnTrain

A tool to generate datasets and models based on vulnerabilities descriptions...

56
Established
4 DataScienceUIBK/HintEval

HintEval💡: A Comprehensive Framework for Hint Generation and Evaluation for Questions

53
Established
5 microsoft/LMChallenge

A library & tools to evaluate predictive language models.

53
Established
6 preligens-lab/textnoisr

Adding random noise to a text dataset, and controlling very accurately the...

53
Established
7 masakhane-io/masakhane-mt

Machine Translation for Africa

51
Established
8 EleanorJiang/BlonDe

Official implementations for (1) BlonDe: An Automatic Evaluation Metric for...

50
Established
9 Maluuba/nlg-eval

Evaluation code for various unsupervised automated metrics for Natural...

49
Emerging
10 disi-unibo-nlp/nlg-metricverse

[COLING22] An End-to-End Library for Evaluating Natural Language Generation

48
Emerging
11 feralvam/easse

Easier Automatic Sentence Simplification Evaluation

47
Emerging
12 wasiahmad/PLBART

Official code of our work, Unified Pre-training for Program Understanding...

46
Emerging
13 gcunhase/NLPMetrics

Python code for various NLP metrics

44
Emerging
14 olivettigroup/materials-synthesis-generative-models

Public release of data and code for materials synthesis generation

44
Emerging
15 LIAAD/tieval

An Evaluation Framework for Temporal Information Extraction Systems

43
Emerging
16 Lambda-3/DiscourseSimplification

Extension of the SentenceSimplification project

42
Emerging
17 dataset-sh/slambda

We turn instruction and examples into plain python function powered by LLM.

37
Emerging
18 microsoft/Litmus

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

37
Emerging
19 abasirat/llm-adapter

A plug-and-play adapter architecture that efficiently adapts large language...

37
Emerging
20 IIIIQIIII/DramaBench

A six-dimensional evaluation framework for drama script continuation with...

36
Emerging
21 Kyle-Ross/glyphdeck

The glyphdeck library is a comprehensive toolkit designed to streamline &...

35
Emerging
22 golsun/SpaceFusion

NAACL'19: "Jointly Optimizing Diversity and Relevance in Neural Response Generation"

35
Emerging
23 zjunlp/MemBase

A Comprehensive Benchmarking Framework for Long-Term Conversational Memory Layers

34
Emerging
24 Joinn99/RocketEval-ICLR

🚀 [ICLR '25] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

34
Emerging
25 4AI/langml

A Keras-based and TensorFlow-backend NLP Models Toolkit.

34
Emerging
26 Sanqiang/text_simplification

Text Simplification Model based on Encoder-Decoder (includes Transformer and...

34
Emerging
27 psunlpgroup/ReaLMistake

This repository includes a benchmark and code for the paper "Evaluating LLMs...

32
Emerging
28 explosion/prodigy-openai-recipes

✨ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3

32
Emerging
29 bassrehab/spark-llm-eval

Spark-native LLM evaluation framework with confidence intervals,...

31
Emerging
30 VityaVitalich/TaxoLLaMA

[ACL 2024] TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Sematic Tasks

30
Emerging
31 namwonss/Math-Solver

Classifier for math word problems using deep learning

30
Emerging
32 sileod/Discovery

Mining Discourse Markers for Unsupervised Sentence Representation Learning

30
Emerging
33 2030NLP/SpaCE2021

中文空间语义理解评测

29
Experimental
34 BM-K/KoMiniLM

Korean Light Weight Language Model

29
Experimental
35 doheejin/HiPAMA

This repository is the implementation of the HiPAMA architecture, introduced...

29
Experimental
36 rashad101/RoMe

PyTorch code for ACL 2022 paper: RoMe: A Robust Metric for Evaluating...

29
Experimental
37 SapienzaNLP/guardians-mt-eval

Official repository of the ACL 2024 paper "Guardians of the Machine...

29
Experimental
38 USC-FORTIS/NLP-ADBench

[EMNLP Findings 2025]. NLP-ADBench is a comprehensive benchmarking tool...

29
Experimental
39 Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks

Repository for code underlying the paper 'Assessing the Impact of OCR...

27
Experimental
40 ksanu1998/static_analysis_codegen_llms

This repository contains code base for project titled Leveraging static...

26
Experimental
41 IIT-DM/BattleofLLMs

Benchmarks of LLMs with Conversational QA datasets.

26
Experimental
42 JonnoB/training_lms_with_synthetic_data

A repo for training Language models to correct errors in OCR text

26
Experimental
43 roboalchemist/dynamic-baml

Python library for dynamic BAML schema generation and LLM structured data...

25
Experimental
44 zy-liu/POSSCORE

This repo is for POSSCORE, an automatic evaluation metric for the...

25
Experimental
45 feralvam/metaeval-simplification

Meta-evaluation of automatic metrics in Text Simplification

25
Experimental
46 miserytale/Little_Language_Model

LittleLM: A tiny character-level n-gram language model for local corpus...

25
Experimental
47 JINO-ROHIT/tachyon

a LLM inference engine to run on consumer hardware

25
Experimental
48 doc-analysis/ReadingBank

ReadingBank: A Benchmark Dataset for Reading Order Detection

25
Experimental
49 OSU-NLP-Group/SELM

Symmetric Encryption with Language Models

25
Experimental
50 davidheineman/salsa

Success and Failure Linguistic Simplification Annotation 💃

25
Experimental
51 language-brainscore/langbrainscore

[Marked for Deprecation. please visit...

25
Experimental
52 lmvasque/ts-explore

Source code for Text Simplification Evaluation papers at ACL findings and...

24
Experimental
53 subramanya1997/Novel-T5

We propose to use a mode that favors sentiment understanding and empathetic...

24
Experimental
54 koguma100/LLM_Prompt_Injection_Capstone

Capstone project for WSU's computer science major with a focus on...

24
Experimental
55 JonnoB/scrambledtext

A python library for creating synthetic corrupted OCR text using a markov process

24
Experimental
56 Lambda-3/SentenceSimplification

Tool to simplify english sentences into their core and context sentences

23
Experimental
57 greg2451/aggregating-text-similarity-metrics

This repository consists of a benchmark of various text similarity measures...

23
Experimental
58 civillibertarian-stressincontinence617/llm-autoeval

🛠️ Simplify LLM evaluation with our Colab notebook; just name your model,...

23
Experimental
59 RAravindDS/CharLLMs

Implementing easy to use "Character Level Language Models" 🕺🏽

23
Experimental
60 licphel/LLMe

LLM trainer for personal computers.

23
Experimental
61 saarus72/text_normalization

T5-based (russian) text normalization

23
Experimental
62 liamcripwell/control_simp

Code and resources for controllable simplification via operation classification.

23
Experimental
63 anto18671/lumenspark

Lumenspark is a lightweight Linformer-based Language Model Trained from Scratch

22
Experimental
64 Omg1221/search_evals

🔍 Evaluate web search APIs with our framework, testing accuracy and...

22
Experimental
65 Kaito1999-script/ULMEvalKit

🛠️ Evaluate unified models effortlessly with ULMEvalKit, your open-source...

22
Experimental
66 balajeekalyan/figureout

FigureOut is a Python package allows developers to easily integrate LLM into...

22
Experimental
67 devxiongmao/llm-scorecaster

LLM-Scorecaster is a Python-based system designed to evaluate and analyze...

22
Experimental
68 seclab-yonsei/mia-ko-lm

Performing membership inference attack (MIA) against Korean language models (LMs).

22
Experimental
69 11NOel11/ChaosBench-Logic

Benchmark dataset and tooling for evaluating LLM logical reasoning and...

22
Experimental
70 sileod/DiscSense

Automated Semantic Analysis of Discourse Markers

21
Experimental
71 megagonlabs/holobench

🫧 Code for Holistic Reasoning with Long-Context LMs: A Benchmark for...

20
Experimental
72 doheejin/SB_loss_PA

This repository is the implementation of the paper, "Score-balanced Loss for...

19
Experimental
73 gsbm/minilm

A lightweight toolkit for experimenting with compact language models

19
Experimental
74 lancopku/meSimp

Codes for "Training Simplification and Model Simplification for Deep...

19
Experimental
75 wa3dbk/llm-batch

LLM Inference CLI - Batch inference with customizable templates

19
Experimental
76 soldni/tokreate

A minimal library to create tokens using LLMs.

18
Experimental
77 idramalab/quantify-llm-explanations

Evaluating Large Language Models for Detecting Antisemitism

18
Experimental
78 chrischenhub/OnlySportsLM

SOTA Sports-domain Language Model under Billion Parameters

18
Experimental
79 rafaelsandroni/gpt3-data-labeling

Data labeling using few shot learning GPT-3.

18
Experimental
80 princeton-nlp/blindfold-textgame

[NAACL 2021] Reading and Acting while Blindfolded: The Need for Semantics in...

18
Experimental
81 kaganhitit11/mergeval

mergeval is a unified tool that lets you merge and evaluate large language...

17
Experimental
82 yancong222/LMs-discourse-connectives-Surprisals

On the Influence of Discourse Connectives on the Predictions of Humans and...

16
Experimental
83 yancong222/ClinicalNLP2024

Python code for LLMs surprisals and linear machine learning models

16
Experimental
84 dsdanielpark/all-about-llm

dsdanielpark's curation and categorization of resources on large language...

16
Experimental
85 ossirytk/llm_resources

Information and resources on everything related about running large language...

16
Experimental
86 ehs9nino/traffic-ocr-llm-benchmark

Benchmark dataset for OCR + LLM document understanding in traffic and...

16
Experimental
87 BramVanroy/mai-simplification-nl-2023

Sentence-Level Text Simplification for Dutch

15
Experimental
88 D0men1c0/Benchmark-Gemma-Models

Highly customizable Python suite for LLM evaluation (Gemma, LLaMA+). Full...

15
Experimental
89 ylkhayat/cocolex

[ACL 2025] Codebase for CoCoLex

15
Experimental
90 soualahmohammedzakaria/Fuzzy-LM

Minimal implementation of a language model with fuzzy word matching.

15
Experimental
91 somsubhra04/LLM_Legal_Prompt_Generation

Data and codes for the EMNLP 2023 paper 'LLMs – the Good, the Bad or the...

14
Experimental
92 OasisSimpDataset/OasisSimpDataset.github.io

OasisSimp: An Open-source Asian-English Sentence Simplification Dataset

14
Experimental
93 alphadl/EasyBLEU

An effective and simple tool to calculate SacreBLEU, Token-BLEU, BLEU w/...

14
Experimental
94 zircote/oolong-pairs

Benchmark harness for A/B testing Claude Code plugins against OOLONG...

14
Experimental
95 BetterAndBetterII/effimemo

A Python package for managing large language model (LLM) context windows,...

14
Experimental
96 VoxDroid/Zylthra

Zylthra: A PyQt6 app to generate synthetic datasets with DataLLM.

14
Experimental
97 audreycs/ImpScore

A repository for paper ImpScore: A Learnable Metric For Quantifying The...

14
Experimental
98 harvey-fin/absence-bench

Code implementation for paper AbsenceBench: Language Models Can't Tell What's Missing

13
Experimental
99 YecanLee/2BeOETG

[ACL 2025 Workshop] Official PyTorch Implementation of "Towards Better...

13
Experimental
100 MChatzakis/ChatMGL

ChatMGL: A Large Language Model Fine-tuned for Data Science Questions.

13
Experimental
101 ebarkhordar/voter-behavior-prediction-LLM

This project explores the predictive power of large language models (LLMs)...

13
Experimental
102 bionlplab/isimp

A sentence simplification system

13
Experimental
103 baojunshan/nlg-metrics

Natural language generation evaluation metrics

13
Experimental
104 orionw/LM-expansions

When do Generative Query and Document Expansions Fail? A Comprehensive Study...

13
Experimental
105 alexfdez1010/ner-llm

A system for doing NER using LLMs and LRMs

13
Experimental
106 sashsinha/nqmp-bench

NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity...

12
Experimental
107 Kseymur/eltex-sheets-addon

Google Sheets add-on for domain-driven synthetic data generation using LLMs.

12
Experimental
108 JonnoB/scrambledtext_analysis

Can synthetic corrupted data be used to train LLM's to correct OCR text?

12
Experimental
109 DFKI-NLP/LLMCheckup

Code for the NAACL 2024 HCI+NLP Workshop paper "LLMCheckup: Conversational...

12
Experimental
110 cx0/llm-typos

Impact of typos and common misspellings on LLM task performance.

12
Experimental
111 codingClaire/Structural-Code-Understanding

A Survey of Deep Learning Models for Structural Code Understanding

11
Experimental
112 sileod/pragmeval

Discourse Based Evaluation of Language Understanding

11
Experimental
113 alok/llmvision

Visualize how LLMs tokenize text - see the world through the eyes of language models

11
Experimental
114 Haiku-Legal/legaleval

LegalEval, high level framework for evaluation of legal LLMs and reasoning...

11
Experimental
115 icecola12/AgenticPOIBench-A-Realistic-Benchmark-for-Agentic-Spatiotemporal-Constrained-POI-Search

AgenticPOIBench: A Realistic Benchmark for Agentic...

11
Experimental
116 pthompson8594/SemanticUTF8

UTF-8 language model compression achieving ~66% token reduction while...

11
Experimental
117 glaciapag/locallm

A simple Python package that lets you interact with a large language model...

10
Experimental
118 daskol/lsp-lm

Language Model as a Language Server

10
Experimental
119 inteldict/CatEval

tool for constituency parsing evaluation

10
Experimental
120 erayyap/lats-for-ollama

A primitive and an inefficient implementation of LATS for usage alongside...

10
Experimental