kolenaIO/autoarena

Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation

/ 100

Emerging

Implements comparative evaluation using Elo-ranked pairwise comparisons from multiple LLM judges (OpenAI, Anthropic, Cohere, or custom) rather than isolated metrics, grounded in research showing head-to-head judgments are more reliable. Employs a "Panel of LLMs" (PoLL) approach using diverse smaller models simultaneously for better accuracy at lower cost than single frontier models. Provides a local web UI with SQLite storage, CSV-based prompt/response ingestion, and support for custom judges connecting to internal services.

108 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

108

Forks

Language

TypeScript

License

Apache-2.0

Higher-rated alternatives

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation...

Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it...

justplus/llm-eval

大语言模型评估平台，支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

Explore RAG Tools

All categories Trending RAG directory Insights