justplus/llm-eval

大语言模型评估平台，支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。

/ 100

Emerging

Provides LLM-agnostic evaluation across multiple task formats (QA, MCQ, RAG) with built-in LLM-as-a-judge scoring using Ragas framework for RAG pipelines, and customizable Jinja2 templates for domain-specific metrics. Includes concurrent performance stress-testing with latency/throughput analysis, multi-model management via unified API configuration, and result export capabilities, all through a web UI built with DaisyUI supporting real-time task status updates.

No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 9 / 25

Maturity 15 / 25

Community 19 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Compare

llm-eval and evalscope

Higher-rated alternatives

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation...

Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it...

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

Addepto/contextcheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable...

Explore RAG Tools

All categories Trending RAG directory Insights