THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')
Comprises 503 expert-curated multiple-choice questions with contexts ranging from 8k to 2M words across six task categories—single/multi-document QA, long in-context learning, dialogue understanding, code repos, and structured data. Evaluation harnesses vLLM for efficient inference serving with configurable tensor parallelism, supporting Chain-of-Thought reasoning and RAG-augmented testing modes. The benchmark exposes performance gaps between standard inference (50.1% best model accuracy) and reasoning-enhanced approaches like o1-preview (57.7%), demonstrating that scaling compute-intensive reasoning is critical for deep long-context understanding.
1,113 stars. No commits in the last 6 months.
Stars
1,113
Forks
120
Language
Python
License
MIT
Category
Last pushed
Jan 15, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/THUDM/LongBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
qcri/LLMeBench
Benchmarking Large Language Models
microsoft/LLF-Bench
A benchmark for evaluating learning agents based on just language feedback