Kareem-Rashed/rubric-eval
Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.
Provides first-class agent evaluation beyond final outputs—assessing tool calls, execution traces, latency, and task completion—while offering zero required dependencies and native pytest integration as a neutral, MIT-licensed alternative to company-owned frameworks. Supports any LLM as a callable judge (OpenAI, Anthropic, Ollama, local) and includes optional metrics for semantic similarity, ROUGE scoring, and cost tracking, with results exportable to local HTML dashboards or JSON for CI/CD pipelines.
Available on PyPI.
Stars
5
Forks
1
Language
Python
License
MIT
Category
Last pushed
Mar 26, 2026
Monthly downloads
215
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/Kareem-Rashed/rubric-eval"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related tools
modelscope/evalscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation...
izam-mohammed/ragrank
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it...
justplus/llm-eval
大语言模型评估平台,支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。
relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
Addepto/contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable...