future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

/ 100

Established

Supports 50+ built-in metrics (faithfulness, toxicity, hallucination detection, RAG quality scoring), LLM-as-Judge augmentation via Gemini/GPT/Claude, and guardrail scanners for jailbreak/injection/secrets detection in <10ms. Integrates distributed task backends (Celery, Ray, Temporal, Kubernetes), feedback loops via ChromaDB, and OpenTelemetry tracing for production observability.

No Package No Dependents

Maintenance 13 / 25

Adoption 9 / 25

Maturity 15 / 25

Community 20 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

GPL-3.0

Featured in

You're Shipping AI You Can't Measure

Compare

ai-evaluation and agentrial

Related agents

StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...

qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...

agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

SparkBeyond/agentune

Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate

Explore AI Agents

All categories Trending AI Agent directory Insights