LLM Benchmark Leaderboards Transformer Models

Comprehensive evaluation frameworks, benchmarks, and leaderboards for comparing LLM performance across diverse tasks and domains. Includes standardized metrics, multi-model comparisons, and scoring systems. Does NOT include performance profiling tools, inference optimization, or model training frameworks.

There are 39 llm benchmark leaderboards models tracked. The highest-rated is TsinghuaC3I/MARTI at 46/100 with 453 stars.

Get all 39 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-benchmark-leaderboards&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	TsinghuaC3I/MARTI A Framework for LLM-based Multi-Agent Reinforced Training and Inference	46	Emerging	453	Python
2	tanyuqian/redco NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A...	44	Emerging	69	Python
3	zjunlp/KnowLM An Open-sourced Knowledgable Large Language Model Framework.	39	Emerging	1,376	Python
4	cli99/llm-analysis Latency and Memory Analysis of Transformer Models for Training and Inference	39	Emerging	479	Python
5	ariannamethod/chuck.optimizer Adam is blind. Chuck sees. Lee 4ever.	38	Emerging	4	C
6	ykjaat6104/LLM-Cost-and-Token-Efficiency-Analysis A benchmark study analyzing cost and token efficiency across 14 LLMs from 5...	34	Emerging	4	Jupyter Notebook
7	stanleylsx/llms_tool 一个基于HuggingFace开发的大语言模型训练、测试工具。支持各模型的webui、终端预测，低参数量及全参数模型训练(预训练、SFT、RM、PPO、D...	33	Emerging	223	Python
8	slp-rl/slamkit SlamKit is an open source tool kit for efficient training of SpeechLMs. It...	32	Emerging	229	Python
9	AdamCoscia/KnowledgeVIS Visually compare fill-in-the-blank LLM prompts to uncover learned biases and...	32	Emerging	7	JavaScript
10	Saivineeth147/llm-testlab Comprehensive Testing Tool for Large Language Models	31	Emerging	6	Python
11	whunextgen/LLMindCraft Shaping Language Models with Cognitive Insights	30	Emerging	15	Python
12	ccmdi/geobench GeoGuessr benchmark for language models	30	Emerging	51	Python
13	opendatalab/UrBench [AAAI 2025]This repo contains evaluation code for the paper “UrBench: A...	30	Emerging	36	Python
14	lechmazur/writing This benchmark tests how well LLMs incorporate a set of 10 mandatory story...	29	Experimental	353	Batchfile
15	AdrianBZG/LLM-distributed-finetune Tune efficiently any LLM model from HuggingFace using distributed training...	28	Experimental	60	Python
16	fboulnois/llm-leaderboard-csv CSVs of the Huggingface and LMArena LLM leaderboards, along with the code to...	27	Experimental	30	Python
17	swainshashwat/Flock Craft custom Language Model Models (LLMs) effortlessly using Flock. Build...	26	Experimental	4	Jupyter Notebook
18	euclaise/SlimTrainer Full finetuning of large language models without large memory requirements	24	Experimental	94	Python
19	aakasharya09/llm-leaderboard 📊 Compare LLM models effortlessly with our tool, showcasing performance...	22	Experimental	—	TypeScript
20	Exahia/llm-benchmark-fr Benchmarks LLM sur tâches métier françaises — Mistral vs Llama vs Qwen vs DeepSeek	22	Experimental	—	Python
21	YousfiNahed/KoValPlus 🌍 Evaluate cultural and value alignment of LLMs with Korean responses using...	22	Experimental	—	Python
22	ayinedjimi/ModelBench Automated LLM Benchmarking on GPU - tokens/sec, latency percentiles, VRAM...	21	Experimental	—	Python
23	OFA-Sys/InsTag InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning	19	Experimental	285	—
24	AdamCoscia/iScore Upload, score, and visually compare multiple LLM-graded summaries simultaneously!	18	Experimental	3	JavaScript
25	koudounasalkis/UnSLU-BENCH This repo contains the code for <<"Alexa, can you forget me?” Machine...	16	Experimental	10	Python
26	bgonzalezbustamante/TextClass-Benchmark TextClass Benchmark Leaderboards	15	Experimental	—	Jupyter Notebook
27	rishi-banerjee1/blindbench Which LLM do you actually trust? Blind-test 100+ AI models with truth...	15	Experimental	1	JavaScript
28	manncodes/rlvr-gsm8k-benchmark Comprehensive benchmarking framework for RLVR/RLHF libraries on GSM8K...	15	Experimental	—	Python
29	ni-lab/guanine GUANinE Benchmark Dataset and Tools	15	Experimental	8	—
30	Phinchanbora/llm-evaluation 🎯 Benchmark LLMs effectively with over 10 tests and 108,000 real questions...	14	Experimental	—	Python
31	lechmazur/deception Benchmark evaluating LLMs on their ability to create and resist...	14	Experimental	32	—
32	EvilFreelancer/benchmarking-llms Comprehensive benchmarks and evaluations of Large Language Models (LLMs)...	12	Experimental	12	Python
33	its-not-rocket-science/mnemosyne An autonomous, distributed knowledge discovery agent combining LLMs and...	12	Experimental	1	Python
34	lechmazur/divergent LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words...	11	Experimental	35	—
35	samidala/polyglot-llm-benchmark A production-ready system to benchmark local LLM inference performance with...	11	Experimental	—	Python
36	kldzj/vllm-transformers5 This repository provides a Docker image for vLLM with transformers>=5.0.0rc0...	11	Experimental	—	Dockerfile
37	procesaur/PaLMA Web Application for textual evaluation and generation using transformers.	10	Experimental	1	Python
38	AdiKsOnDev/PrivateFalcon Use Falcon 7B L.L.M. to privately query your documents. No data leaks	10	Experimental	1	Python
39	BjornMelin/llm-gpu-optimization 🚄 Advanced LLM optimization techniques using CUDA. Features efficient...	10	Experimental	1	—