DavidShableski/llm-evaluation-framework
A production-grade platform to evaluate and compare the performance of Large Language Models (LLMs) like OpenAI, Anthropic, and Google’s PaLM. It features real time analytics, hallucination detection, and cost performance benchmarking using standardized datasets (e.g., GSM8K).
No commits in the last 6 months.
Stars
—
Forks
—
Language
TypeScript
License
MIT
Category
Last pushed
Sep 11, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/DavidShableski/llm-evaluation-framework"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents