Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

52
/ 100
Established

Provides first-class agent evaluation beyond final outputs—assessing tool calls, execution traces, latency, and task completion—while offering zero required dependencies and native pytest integration as a neutral, MIT-licensed alternative to company-owned frameworks. Supports any LLM as a callable judge (OpenAI, Anthropic, Ollama, local) and includes optional metrics for semantic similarity, ROUGE scoring, and cost tracking, with results exportable to local HTML dashboards or JSON for CI/CD pipelines.

Available on PyPI.

No Dependents
Maintenance 13 / 25
Adoption 9 / 25
Maturity 18 / 25
Community 12 / 25

How are scores calculated?

Stars

5

Forks

1

Language

Python

License

MIT

Last pushed

Mar 26, 2026

Monthly downloads

215

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/Kareem-Rashed/rubric-eval"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.