hidai25/eval-view

Regression testing for AI agents. Snapshot behavior, diff tool calls, catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

/ 100

Established

Captures tool-call sequences and parameter changes across multi-turn agent interactions, then compares against golden baselines to surface silent regressions that pass health checks. Detects coordinated drift across test runs to distinguish provider rollouts from system failures, using a four-layer scoring system from free offline tool analysis to optional LLM-as-judge evaluation. Integrates deeply with CI/CD via GitHub Actions with automatic PR comments and cost/latency tracking, while supporting agent auto-detection across LangGraph, CrewAI, and other frameworks without requiring API keys for baseline operations.

No Package No Dependents

Maintenance 13 / 25

Adoption 8 / 25

Maturity 13 / 25

Community 18 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Related agents

StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...

qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...

agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

Explore AI Agents

All categories Trending AI Agent directory Insights