THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

45
/ 100
Emerging

Comprises 503 expert-curated multiple-choice questions with contexts ranging from 8k to 2M words across six task categories—single/multi-document QA, long in-context learning, dialogue understanding, code repos, and structured data. Evaluation harnesses vLLM for efficient inference serving with configurable tensor parallelism, supporting Chain-of-Thought reasoning and RAG-augmented testing modes. The benchmark exposes performance gaps between standard inference (50.1% best model accuracy) and reasoning-enhanced approaches like o1-preview (57.7%), demonstrating that scaling compute-intensive reasoning is critical for deep long-context understanding.

1,113 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 19 / 25

How are scores calculated?

Stars

1,113

Forks

120

Language

Python

License

MIT

Last pushed

Jan 15, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/THUDM/LongBench"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.