g4ix/advLab1-HITS
Project for an advanced lab investigating LLM benchmarks from an IR perspective. Instead of focusing on model performance, we evaluated benchmark robustness, identifying which questions truly differentiate models and whether leaderboard rankings reflect real differences or are dominated by easy, high-hubness items.
No commits in the last 6 months.
Stars
1
Forks
—
Language
Jupyter Notebook
License
—
Category
Last pushed
Oct 10, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/g4ix/advLab1-HITS"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
qcri/LLMeBench
Benchmarking Large Language Models
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')