Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

/ 100

Established

Provides first-class agent evaluation beyond final outputs—assessing tool calls, execution traces, latency, and task completion—while offering zero required dependencies and native pytest integration as a neutral, MIT-licensed alternative to company-owned frameworks. Supports any LLM as a callable judge (OpenAI, Anthropic, Ollama, local) and includes optional metrics for semantic similarity, ROUGE scoring, and cost tracking, with results exportable to local HTML dashboards or JSON for CI/CD pipelines.

Available on PyPI.

No Dependents

Maintenance 13 / 25

Adoption 9 / 25

Maturity 18 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Compare

rubric-eval and llm-eval-bench

Related tools

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation...

izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it...

justplus/llm-eval

大语言模型评估平台，支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

Addepto/contextcheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable...

Explore RAG Tools

All categories Trending RAG directory Insights