rubric-eval and llm-eval-bench
These are competitors offering overlapping local evaluation capabilities for LLMs and RAG systems, with rubric-eval providing a more mature, battle-tested framework while llm-eval-bench appears to be an earlier-stage alternative focused specifically on RAG workflow evaluation.
About rubric-eval
Kareem-Rashed/rubric-eval
Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.
Provides first-class agent evaluation beyond final outputs—assessing tool calls, execution traces, latency, and task completion—while offering zero required dependencies and native pytest integration as a neutral, MIT-licensed alternative to company-owned frameworks. Supports any LLM as a callable judge (OpenAI, Anthropic, Ollama, local) and includes optional metrics for semantic similarity, ROUGE scoring, and cost tracking, with results exportable to local HTML dashboards or JSON for CI/CD pipelines.
About llm-eval-bench
piog/llm-eval-bench
Evaluation harness for prompts, structured outputs, and RAG workflows
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work