LLM Benchmark Leaderboards Transformer Models
Comprehensive evaluation frameworks, benchmarks, and leaderboards for comparing LLM performance across diverse tasks and domains. Includes standardized metrics, multi-model comparisons, and scoring systems. Does NOT include performance profiling tools, inference optimization, or model training frameworks.
There are 39 llm benchmark leaderboards models tracked. The highest-rated is TsinghuaC3I/MARTI at 46/100 with 453 stars.
Get all 39 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-benchmark-leaderboards&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
TsinghuaC3I/MARTI
A Framework for LLM-based Multi-Agent Reinforced Training and Inference |
|
Emerging |
| 2 |
tanyuqian/redco
NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A... |
|
Emerging |
| 3 |
zjunlp/KnowLM
An Open-sourced Knowledgable Large Language Model Framework. |
|
Emerging |
| 4 |
cli99/llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference |
|
Emerging |
| 5 |
ariannamethod/chuck.optimizer
Adam is blind. Chuck sees. Lee 4ever. |
|
Emerging |
| 6 |
ykjaat6104/LLM-Cost-and-Token-Efficiency-Analysis
A benchmark study analyzing cost and token efficiency across 14 LLMs from 5... |
|
Emerging |
| 7 |
stanleylsx/llms_tool
一个基于HuggingFace开发的大语言模型训练、测试工具。支持各模型的webui、终端预测,低参数量及全参数模型训练(预训练、SFT、RM、PPO、D... |
|
Emerging |
| 8 |
slp-rl/slamkit
SlamKit is an open source tool kit for efficient training of SpeechLMs. It... |
|
Emerging |
| 9 |
AdamCoscia/KnowledgeVIS
Visually compare fill-in-the-blank LLM prompts to uncover learned biases and... |
|
Emerging |
| 10 |
Saivineeth147/llm-testlab
Comprehensive Testing Tool for Large Language Models |
|
Emerging |
| 11 |
whunextgen/LLMindCraft
Shaping Language Models with Cognitive Insights |
|
Emerging |
| 12 |
ccmdi/geobench
GeoGuessr benchmark for language models |
|
Emerging |
| 13 |
opendatalab/UrBench
[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A... |
|
Emerging |
| 14 |
lechmazur/writing
This benchmark tests how well LLMs incorporate a set of 10 mandatory story... |
|
Experimental |
| 15 |
AdrianBZG/LLM-distributed-finetune
Tune efficiently any LLM model from HuggingFace using distributed training... |
|
Experimental |
| 16 |
fboulnois/llm-leaderboard-csv
CSVs of the Huggingface and LMArena LLM leaderboards, along with the code to... |
|
Experimental |
| 17 |
swainshashwat/Flock
Craft custom Language Model Models (LLMs) effortlessly using Flock. Build... |
|
Experimental |
| 18 |
euclaise/SlimTrainer
Full finetuning of large language models without large memory requirements |
|
Experimental |
| 19 |
aakasharya09/llm-leaderboard
📊 Compare LLM models effortlessly with our tool, showcasing performance... |
|
Experimental |
| 20 |
Exahia/llm-benchmark-fr
Benchmarks LLM sur tâches métier françaises — Mistral vs Llama vs Qwen vs DeepSeek |
|
Experimental |
| 21 |
YousfiNahed/KoValPlus
🌍 Evaluate cultural and value alignment of LLMs with Korean responses using... |
|
Experimental |
| 22 |
ayinedjimi/ModelBench
Automated LLM Benchmarking on GPU - tokens/sec, latency percentiles, VRAM... |
|
Experimental |
| 23 |
OFA-Sys/InsTag
InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning |
|
Experimental |
| 24 |
AdamCoscia/iScore
Upload, score, and visually compare multiple LLM-graded summaries simultaneously! |
|
Experimental |
| 25 |
koudounasalkis/UnSLU-BENCH
This repo contains the code for <<"Alexa, can you forget me?” Machine... |
|
Experimental |
| 26 |
bgonzalezbustamante/TextClass-Benchmark
TextClass Benchmark Leaderboards |
|
Experimental |
| 27 |
rishi-banerjee1/blindbench
Which LLM do you actually trust? Blind-test 100+ AI models with truth... |
|
Experimental |
| 28 |
manncodes/rlvr-gsm8k-benchmark
Comprehensive benchmarking framework for RLVR/RLHF libraries on GSM8K... |
|
Experimental |
| 29 |
ni-lab/guanine
GUANinE Benchmark Dataset and Tools |
|
Experimental |
| 30 |
Phinchanbora/llm-evaluation
🎯 Benchmark LLMs effectively with over 10 tests and 108,000 real questions... |
|
Experimental |
| 31 |
lechmazur/deception
Benchmark evaluating LLMs on their ability to create and resist... |
|
Experimental |
| 32 |
EvilFreelancer/benchmarking-llms
Comprehensive benchmarks and evaluations of Large Language Models (LLMs)... |
|
Experimental |
| 33 |
its-not-rocket-science/mnemosyne
An autonomous, distributed knowledge discovery agent combining LLMs and... |
|
Experimental |
| 34 |
lechmazur/divergent
LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words... |
|
Experimental |
| 35 |
samidala/polyglot-llm-benchmark
A production-ready system to benchmark local LLM inference performance with... |
|
Experimental |
| 36 |
kldzj/vllm-transformers5
This repository provides a Docker image for vLLM with transformers>=5.0.0rc0... |
|
Experimental |
| 37 |
procesaur/PaLMA
Web Application for textual evaluation and generation using transformers. |
|
Experimental |
| 38 |
AdiKsOnDev/PrivateFalcon
Use Falcon 7B L.L.M. to privately query your documents. No data leaks |
|
Experimental |
| 39 |
BjornMelin/llm-gpu-optimization
🚄 Advanced LLM optimization techniques using CUDA. Features efficient... |
|
Experimental |