LLM Benchmark Leaderboards Transformer Models

Comprehensive evaluation frameworks, benchmarks, and leaderboards for comparing LLM performance across diverse tasks and domains. Includes standardized metrics, multi-model comparisons, and scoring systems. Does NOT include performance profiling tools, inference optimization, or model training frameworks.

There are 39 llm benchmark leaderboards models tracked. The highest-rated is TsinghuaC3I/MARTI at 46/100 with 453 stars.

Get all 39 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-benchmark-leaderboards&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 TsinghuaC3I/MARTI

A Framework for LLM-based Multi-Agent Reinforced Training and Inference

46
Emerging
2 tanyuqian/redco

NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A...

44
Emerging
3 zjunlp/KnowLM

An Open-sourced Knowledgable Large Language Model Framework.

39
Emerging
4 cli99/llm-analysis

Latency and Memory Analysis of Transformer Models for Training and Inference

39
Emerging
5 ariannamethod/chuck.optimizer

Adam is blind. Chuck sees. Lee 4ever.

38
Emerging
6 ykjaat6104/LLM-Cost-and-Token-Efficiency-Analysis

A benchmark study analyzing cost and token efficiency across 14 LLMs from 5...

34
Emerging
7 stanleylsx/llms_tool

一个基于HuggingFace开发的大语言模型训练、测试工具。支持各模型的webui、终端预测,低参数量及全参数模型训练(预训练、SFT、RM、PPO、D...

33
Emerging
8 slp-rl/slamkit

SlamKit is an open source tool kit for efficient training of SpeechLMs. It...

32
Emerging
9 AdamCoscia/KnowledgeVIS

Visually compare fill-in-the-blank LLM prompts to uncover learned biases and...

32
Emerging
10 Saivineeth147/llm-testlab

Comprehensive Testing Tool for Large Language Models

31
Emerging
11 whunextgen/LLMindCraft

Shaping Language Models with Cognitive Insights

30
Emerging
12 ccmdi/geobench

GeoGuessr benchmark for language models

30
Emerging
13 opendatalab/UrBench

[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A...

30
Emerging
14 lechmazur/writing

This benchmark tests how well LLMs incorporate a set of 10 mandatory story...

29
Experimental
15 AdrianBZG/LLM-distributed-finetune

Tune efficiently any LLM model from HuggingFace using distributed training...

28
Experimental
16 fboulnois/llm-leaderboard-csv

CSVs of the Huggingface and LMArena LLM leaderboards, along with the code to...

27
Experimental
17 swainshashwat/Flock

Craft custom Language Model Models (LLMs) effortlessly using Flock. Build...

26
Experimental
18 euclaise/SlimTrainer

Full finetuning of large language models without large memory requirements

24
Experimental
19 aakasharya09/llm-leaderboard

📊 Compare LLM models effortlessly with our tool, showcasing performance...

22
Experimental
20 Exahia/llm-benchmark-fr

Benchmarks LLM sur tâches métier françaises — Mistral vs Llama vs Qwen vs DeepSeek

22
Experimental
21 YousfiNahed/KoValPlus

🌍 Evaluate cultural and value alignment of LLMs with Korean responses using...

22
Experimental
22 ayinedjimi/ModelBench

Automated LLM Benchmarking on GPU - tokens/sec, latency percentiles, VRAM...

21
Experimental
23 OFA-Sys/InsTag

InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning

19
Experimental
24 AdamCoscia/iScore

Upload, score, and visually compare multiple LLM-graded summaries simultaneously!

18
Experimental
25 koudounasalkis/UnSLU-BENCH

This repo contains the code for <<"Alexa, can you forget me?” Machine...

16
Experimental
26 bgonzalezbustamante/TextClass-Benchmark

TextClass Benchmark Leaderboards

15
Experimental
27 rishi-banerjee1/blindbench

Which LLM do you actually trust? Blind-test 100+ AI models with truth...

15
Experimental
28 manncodes/rlvr-gsm8k-benchmark

Comprehensive benchmarking framework for RLVR/RLHF libraries on GSM8K...

15
Experimental
29 ni-lab/guanine

GUANinE Benchmark Dataset and Tools

15
Experimental
30 Phinchanbora/llm-evaluation

🎯 Benchmark LLMs effectively with over 10 tests and 108,000 real questions...

14
Experimental
31 lechmazur/deception

Benchmark evaluating LLMs on their ability to create and resist...

14
Experimental
32 EvilFreelancer/benchmarking-llms

Comprehensive benchmarks and evaluations of Large Language Models (LLMs)...

12
Experimental
33 its-not-rocket-science/mnemosyne

An autonomous, distributed knowledge discovery agent combining LLMs and...

12
Experimental
34 lechmazur/divergent

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words...

11
Experimental
35 samidala/polyglot-llm-benchmark

A production-ready system to benchmark local LLM inference performance with...

11
Experimental
36 kldzj/vllm-transformers5

This repository provides a Docker image for vLLM with transformers>=5.0.0rc0...

11
Experimental
37 procesaur/PaLMA

Web Application for textual evaluation and generation using transformers.

10
Experimental
38 AdiKsOnDev/PrivateFalcon

Use Falcon 7B L.L.M. to privately query your documents. No data leaks

10
Experimental
39 BjornMelin/llm-gpu-optimization

🚄 Advanced LLM optimization techniques using CUDA. Features efficient...

10
Experimental