ai-evaluation and agentrial

These are **competitors**: both provide statistical evaluation frameworks for AI agents, but agentrial focuses on rigorous statistical testing of agent behavior while ai-evaluation positions itself as a broader workflow evaluation platform, so users would likely select one based on whether they prioritize statistical rigor versus evaluation breadth.

ai-evaluation

Established

agentrial

Emerging

Maintenance 13/25

Adoption 9/25

Maturity 15/25

Community 20/25

Maintenance 10/25

Adoption 11/25

Maturity 18/25

Community 10/25

Stars: 84

Forks: 29

Downloads: —

Commits (30d): 0

Language: Python

License: GPL-3.0

Stars: 15

Forks: 2

Downloads: 222

Commits (30d): 0

Language: Python

License: MIT

No Package No Dependents

No risk flags

About ai-evaluation

future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

Supports 50+ built-in metrics (faithfulness, toxicity, hallucination detection, RAG quality scoring), LLM-as-Judge augmentation via Gemini/GPT/Claude, and guardrail scanners for jailbreak/injection/secrets detection in <10ms. Integrates distributed task backends (Celery, Ray, Temporal, Kubernetes), feedback loops via ChromaDB, and OpenTelemetry tracing for production observability.

About agentrial

alepot55/agentrial

Statistical evaluation framework for AI agents

Provides multi-trial statistical evaluation with Wilson confidence intervals and step-level failure attribution using Fisher exact tests to identify where agent behavior diverges. Integrates natively with LangGraph, CrewAI, Pydantic AI, and other frameworks through adapters, automatically capturing trajectories and token costs across 45+ LLM models, while supporting CI/CD regression detection and production monitoring via drift detectors.

Scores updated daily from GitHub, PyPI, and npm data. How scores work