TJ-Neary/AI_Eval

Comprehensive LLM evaluation framework comparing local and cloud models with hardware-aware benchmarking. Evaluate across code generation, document analysis, and structured output using pass@k, LLM-as-Judge, and RAG metrics. Supports Ollama, Google Gemini, Anthropic, and OpenAI.

/ 100

Experimental

No Package No Dependents

Maintenance 13 / 25

Adoption 0 / 25

Maturity 11 / 25

Community 0 / 25

How are scores calculated?

Stars

—

Forks

—

Language

Python

License

MIT

Related tools

masaakisakamoto/memory-os

Deterministic continuity for AI systems. Detect and repair inconsistencies across sessions — not...

dahlinomine/local-llm-rag-bench

Python tool for benchmarking local LLM performance on specific RAG datasets.

VectoringAI/ai-engineering

Practical tutorials to build AI Engineering skills

priyanshus/evaliphy

E2E RAG Testing Tool

moshe19909090/llm-evaluation-pipeline

End-to-end LLM evaluation pipeline with human and automated judging for e-commerce product descriptions

Explore RAG Tools

All categories Trending RAG directory Insights