rubric-eval and llm-eval-bench

These are competitors offering overlapping local evaluation capabilities for LLMs and RAG systems, with rubric-eval providing a more mature, battle-tested framework while llm-eval-bench appears to be an earlier-stage alternative focused specifically on RAG workflow evaluation.

rubric-eval

Established

llm-eval-bench

Experimental

Maintenance 13/25

Adoption 9/25

Maturity 18/25

Community 12/25

Maintenance 13/25

Adoption 0/25

Maturity 9/25

Community 0/25

Stars: 5

Forks: 1

Downloads: 215

Commits (30d): 0

Language: Python

License: MIT

Stars: —

Forks: —

Downloads: —

Commits (30d): 0

Language: Python

License: MIT

No Dependents

No Package No Dependents

About rubric-eval

Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

Provides first-class agent evaluation beyond final outputs—assessing tool calls, execution traces, latency, and task completion—while offering zero required dependencies and native pytest integration as a neutral, MIT-licensed alternative to company-owned frameworks. Supports any LLM as a callable judge (OpenAI, Anthropic, Ollama, local) and includes optional metrics for semantic similarity, ROUGE scoring, and cost tracking, with results exportable to local HTML dashboards or JSON for CI/CD pipelines.

About llm-eval-bench

piog/llm-eval-bench

Evaluation harness for prompts, structured outputs, and RAG workflows

Related comparisons

rubric-eval and evalscope

Scores updated daily from GitHub, PyPI, and npm data. How scores work