rubric-eval and llm-eval-bench

These are competitors offering overlapping local evaluation capabilities for LLMs and RAG systems, with rubric-eval providing a more mature, battle-tested framework while llm-eval-bench appears to be an earlier-stage alternative focused specifically on RAG workflow evaluation.

rubric-eval
52
Established
llm-eval-bench
22
Experimental
Maintenance 13/25
Adoption 9/25
Maturity 18/25
Community 12/25
Maintenance 13/25
Adoption 0/25
Maturity 9/25
Community 0/25
Stars: 5
Forks: 1
Downloads: 215
Commits (30d): 0
Language: Python
License: MIT
Stars:
Forks:
Downloads:
Commits (30d): 0
Language: Python
License: MIT
No Dependents
No Package No Dependents

About rubric-eval

Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

Provides first-class agent evaluation beyond final outputs—assessing tool calls, execution traces, latency, and task completion—while offering zero required dependencies and native pytest integration as a neutral, MIT-licensed alternative to company-owned frameworks. Supports any LLM as a callable judge (OpenAI, Anthropic, Ollama, local) and includes optional metrics for semantic similarity, ROUGE scoring, and cost tracking, with results exportable to local HTML dashboards or JSON for CI/CD pipelines.

About llm-eval-bench

piog/llm-eval-bench

Evaluation harness for prompts, structured outputs, and RAG workflows

Related comparisons

Scores updated daily from GitHub, PyPI, and npm data. How scores work