ai-evaluation and agentrial
These are **competitors**: both provide statistical evaluation frameworks for AI agents, but agentrial focuses on rigorous statistical testing of agent behavior while ai-evaluation positions itself as a broader workflow evaluation platform, so users would likely select one based on whether they prioritize statistical rigor versus evaluation breadth.
About ai-evaluation
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
Supports 50+ built-in metrics (faithfulness, toxicity, hallucination detection, RAG quality scoring), LLM-as-Judge augmentation via Gemini/GPT/Claude, and guardrail scanners for jailbreak/injection/secrets detection in <10ms. Integrates distributed task backends (Celery, Ray, Temporal, Kubernetes), feedback loops via ChromaDB, and OpenTelemetry tracing for production observability.
About agentrial
alepot55/agentrial
Statistical evaluation framework for AI agents
Provides multi-trial statistical evaluation with Wilson confidence intervals and step-level failure attribution using Fisher exact tests to identify where agent behavior diverges. Integrates natively with LangGraph, CrewAI, Pydantic AI, and other frameworks through adapters, automatically capturing trajectories and token costs across 45+ LLM models, while supporting CI/CD regression detection and production monitoring via drift detectors.
Scores updated daily from GitHub, PyPI, and npm data. How scores work