LLM Evaluation Platforms Generative AI Tools

Tools for testing, evaluating, and monitoring LLM applications in production—including automated evaluation frameworks, A/B testing, observability, quality control, and performance tracking. Does NOT include general ML ops platforms, code generation tools, or domain-specific AI applications.

There are 142 llm evaluation platforms tools tracked. 2 score above 70 (verified tier). The highest-rated is madroidmaq/mlx-omni-server at 82/100 with 678 stars and 2,273 monthly downloads. 3 of the top 10 are actively maintained.

Get all 142 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=generative-ai&subcategory=llm-evaluation-platforms&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	madroidmaq/mlx-omni-server MLX Omni Server is a local inference server powered by Apple's MLX...	82	Verified	678	Python
2	openvinotoolkit/model_server A scalable inference server for models optimized with OpenVINO™	74	Verified	836	C++
3	rhesis-ai/rhesis Open-source platform & SDK for testing LLM and agentic apps. Define expected...	68	Established	296	Python
4	NVIDIA-NeMo/Guardrails NeMo Guardrails is an open-source toolkit for easily adding programmable...	66	Established	5,772	Python
5	taco-group/OpenEMMA OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.	62	Established	906	Python
6	generative-computing/mellea Mellea is a library for writing generative programs.	61	Established	341	Python
7	cncf/llm-starter-pack 🤖 Get started with LLMs on your kind cluster, today!	56	Established	172	Python
8	GoogleCloudDataproc/dataproc-ml-python Library to simplify running distributed ML workloads with Apache Spark	51	Established	7	Python
9	Ultrathink-Solutions/openclaw-logfire Pydantic Logfire observability plugin for OpenClaw — OTEL GenAI semantic...	50	Established	2	TypeScript
10	modular/max-agentic-cookbook MAX Agentic Cookbook	49	Emerging	74	HTML
11	aws-samples/foundation-model-benchmarking-tool Foundation model benchmarking tool. Run any model on any AWS platform and...	48	Emerging	255	Jupyter Notebook
12	cuckoo-network/cuckoo Cuckoo is a Decentralized AI Model-Serving Platform, starting with...	48	Emerging	407	TypeScript
13	clearml/clearml-fractional-gpu ClearML Fractional GPU - Run multiple containers on the same GPU with driver...	47	Emerging	90	—
14	jordanvolz/lolpop A software engineering framework to jump start your machine learning projects	46	Emerging	37	Python
15	hichipli/vetting-python A Python implementation of the VETTING (Verification and Evaluation Tool for...	43	Emerging	10	Python
16	vienneraphael/batchling Save 50% off GenAI costs in two lines of code	42	Emerging	17	Python
17	Aaryanverma/trustifai TrustifAI: A Comprehensive Framework for AI Trustworthiness	40	Emerging	10	Python
18	autonomi-ai/nos ⚡️ A fast and flexible PyTorch inference server that runs locally, on any...	38	Emerging	147	Python
19	svilupp/Julia-LLM-Leaderboard Provides a platform for the Julia community to compare AI models' abilities...	38	Emerging	86	HTML
20	AMDResearch/intelliperf Automated bottleneck detection and solution orchestration	37	Emerging	19	Python
21	maximhq/maxim-cookbooks Maxim is an end-to-end AI evaluation and observability platform that...	36	Emerging	13	Jupyter Notebook
22	yankeexe/ollama-manager 🦙 Manage Ollama models from your CLI!	36	Emerging	16	Python
23	yonahgraphics/openevalkit Production-grade Python framework for evaluating LLM and agentic systems...	36	Emerging	3	Python
24	amazon-science/fmcore Running Foundation Models at every scale, on every modality. Includes...	36	Emerging	6	Python
25	sandner-art/ArtAgents Framework for LLM based captioning and prompt engineering	35	Emerging	14	Python
26	radlab-dev-group/llm-router LLM Router is a service that can be deployed on‑premises or in the cloud. It...	35	Emerging	5	Python
27	kstathou/llm-stack End-to-end tech stack for the LLM data flywheel	35	Emerging	3	Python
28	aimonlabs/aimon-python-sdk This repo hosts the Python SDK and related examples for AIMon, which is a...	35	Emerging	11	Python
29	soundstarrain/LLM-Filter-Probe 一款针对 LLM 输入侧审查的精确逆向分析工具。自动定位 NewAPI、OneAPI 及任何实施基于字典规则进行 Prompt 过滤的 API...	34	Emerging	3	Python
30	unit-mesh/devops-genius DevOpsGenius 旨在结合 LLM 重塑软件开发中的 DevOps 实践。将 LLM 视为团队的初级...	33	Emerging	92	Kotlin
31	amazon-science/concurry Easy scaling for AI research and production workloads	33	Emerging	14	Python
32	metanoia-oss/promptguard Reliable, structured, production-safe LLM outputs with schema validation and...	32	Emerging	1	Python
33	Finoptimize/agentaflow-sro-community Manage AI and Machine Learning workloads more efficiently with lower cost: ...	31	Emerging	2	Go
34	retkowsky/foundry-local Foundry Local is an on-device AI inference solution that you use to run AI...	31	Emerging	9	Jupyter Notebook
35	JuryMindAI/jurymind-ai Framework for agentic evaluation of LLMs, Prompt Optimization, Data...	28	Experimental	1	Python
36	sMiNT0S/AIBugBench From prompt to paste: evaluate AI / LLM output under a strict Python sandbox...	28	Experimental	1	Python
37	DanTheAI/LLM-Middleware-Pipeline A modular, configurable LLM middleware pipeline that transforms raw prompts...	27	Experimental	3	Python
38	Aryan-202/cookbooks An intelligent optimization engine that dynamically adjusts LLM selection,...	27	Experimental	—	Jupyter Notebook
39	Impesud/ai-mlops-project AI MLOps Project – A production-grade MLOps pipeline for scalable,...	26	Experimental	4	Python
40	llm-platform-security/gpt-data-exposure An In-Depth Investigation of Data Collection in LLM App Ecosystems	26	Experimental	3	Python
41	Generative-Engine-Marketing/GEM-Bench First comprehensive benchmark for Generative Engine Marketing (GEM), an...	26	Experimental	15	Python
42	LLMConsent/llmconsent-standards LLMConsent is an open protocol that establishes standards for managing...	25	Experimental	2	—
43	rpjayaraman/LLMxVLSI Generate, Simulate & Summarize Verilog Code with GenAI and Iverilog tool	25	Experimental	5	Python
44	paralleliq/piqc-knowledge-base Production-ready checklists and frameworks for deploying LLMs, GenAI models,...	24	Experimental	2	—
45	djokester/groqeval Use groq for evaluations	24	Experimental	3	Python
46	fmind/mlops-digester A tool equipping Pydantic AI agents with the ability to digest and summarize...	24	Experimental	4	Python
47	squishai/squish 🤖🗜️⚡️ Compress local LLMs once, run them forever at sub-second load times....	24	Experimental	2	Python
48	hiamitabha/genai-bench Code to benchmark APIs available from LLM vendors and demostrate how they work	24	Experimental	4	Python
49	nginH/llmforge One API, every AI model, instant switching. Change from GPT-4 to Gemini to...	24	Experimental	6	TypeScript
50	Yapakayala/cloudops-ai-monitor 🔍 Monitor cloud environments with AI-driven insights, anomaly detection, and...	23	Experimental	1	Python
51	iservicebus/lmaas LMaaS (Language Model as a Service) abstracts away complexities and enables...	23	Experimental	2	Python
52	evalops/eval2otel Library to convert AI evaluation results to OpenTelemetry GenAI semantic...	23	Experimental	3	TypeScript
53	noct-ml/noesis Noesis - A lightweight toolkit for inspecting transformer internals through...	23	Experimental	1	Python
54	wesleyscholl/squish 🤖🗜️⚡️ Compress local LLMs once, run them forever at sub-second load times....	23	Experimental	1	Python
55	SangiSI/llm-model-selection-lab Decision-centric evaluation lab for intelligent LLM model selection using...	23	Experimental	1	Python
56	last9/python-ai-sdk OpenTelemetry extension for LLM observability - track conversations,...	23	Experimental	1	Python
57	Ashik245-commits/LLM-Filter-Probe 🕵️♂️ Analyze and reverse engineer keyword filtering in large language models...	23	Experimental	1	Python
58	SAP-samples/llm-round-trip-correctness This repo provides code for evaluation of llm round-trip-correctness on text...	23	Experimental	6	Jupyter Notebook
59	sugihAF/DomainBench LLM Benchmark and Comparison on Domain Specific Implementation	23	Experimental	1	Python
60	josephlash10-svg/Glass-Box A Python-based framework for managing LLM drift and preventing model...	23	Experimental	1	Python
61	maharshijani05/CivicMind CivicMind is an AI-powered civic policy simulator where intelligent agents...	22	Experimental	3	Python
62	verma-kunal/k8sGPT-tutorial This repo is dedicated for the K8sGPT tutorial on Kubesimplify's YT channel.	22	Experimental	1	—
63	Mrdodo446/ModelForge Build and customize machine learning models efficiently with an open-source...	22	Experimental	—	TypeScript
64	mauryasameer/llm_eval SR 11-7 & EU AI Act compliant LLM validation framework for financial...	22	Experimental	—	Python
65	Retamoso23/ollixir 🤖 Enable local large language models with Ollixir, the Elixir client...	22	Experimental	—	Elixir
66	xxxihrmn/llmops 🚀 Discover top tools and resources for Large Language Model Operations...	22	Experimental	—	Shell
67	nyno-ai/nynoflow Production grade framework for LLM application development	22	Experimental	2	Python
68	AdityaPatange1/okesa Okesa: LLM-powered Natural Language Processing! 💬	22	Experimental	1	TypeScript
69	ravikirankrishnaprasad/multi-agent-hallucination-detection-and-correction Multi-agent framework for hallucination detection and correction in LLM...	22	Experimental	—	Python
70	danilop/llm-test-mate A simple testing framework to evaluate and validate LLM-generated content...	21	Experimental	10	Python
71	Shyam-Sundar-Raju/Consensus CONSENSUS — A learning-aware generative AI system using a multi-agent LLM...	20	Experimental	1	JavaScript
72	demml/potatohead 🥔 Quality control for your potato heads (LLMs)	20	Experimental	1	Rust
73	Portkey-AI/helm-chart Kubernetes Configs for Portkey Gateway deployment	20	Experimental	3	Smarty
74	leaxer-ai/leaxer An engine for local AI inference, built on Elixir and the BEAM virtual machine.	20	Experimental	1	Elixir
75	Ratnesh-181998/Production-Ready-MLOps-Pipelines Production-grade MLOps pipelines with real-world ML and NLP projects.Covers...	20	Experimental	1	—
76	robocorp/llmfoo Code with the flow of a river, refactor with the grace of a breeze, and...	20	Experimental	14	Python
77	Tradunsky/3D-guardrails 3D content you can trust	20	Experimental	1	Python
78	valohai/valohai-llm Track and report LLM and GenAI evaluations to Valohai LLM	20	Experimental	1	Python
79	radlab-dev-group/llm-router-plugins A companion repository for llm-router containing a collection of...	20	Experimental	1	Python
80	sochaty/llm-governance-engine A robust LLM Governance & ROI Evaluation platform designed to benchmark...	19	Experimental	—	Python
81	umbertocicciaa/devopsfix Fix cicd pipeline using generative AI	19	Experimental	—	TypeScript
82	adityonugrohoid/ollama-runtime Shared Ollama LLM runtime for the GenAI Portfolio Suite. GPU-accelerated...	19	Experimental	—	Python
83	Yu-amd/Multiverse Lightweight model inference playground	19	Experimental	—	Python
84	hari7261/indus-llm-gateway Production-ready LLM gateway — unified OpenAI-compatible API for all...	19	Experimental	—	Go
85	infinitum-nihil/otel-genai-safety-semconv Proposed OpenTelemetry semantic conventions for GenAI safety system telemetry	19	Experimental	—	—
86	Lavaver/OpenVINO-GenAI-Toolkit This repository provides a post-installation utility suite for OpenVINO,...	19	Experimental	—	Vue
87	adityonugrohoid/ollama-multi-llm-server Multi-model inference API and playground powered by Ollama. Serve, switch,...	19	Experimental	—	Python
88	mkhomutskyi/illama Ollama-like LLM experience for Intel Arc GPUs (B50/A770/A750) using...	19	Experimental	—	Python
89	korkridake/GenAIOps-OSS A unified handbook for building, deploying and understanding LLM agents and...	19	Experimental	—	Python
90	hipvlady/subzero Project SubZeo: Zero Trust AI Gateway (ZTAG)	18	Experimental	3	Python
91	sharonccccc/AIFE_GEN-MLOps_Platform AI capability development platform using AutoML and AutoGluon	18	Experimental	7	Jupyter Notebook
92	svilupp/Logfire.jl Observability for Julia LLM applications. Know what your AI is doing.	17	Experimental	2	Julia
93	sylym/subtext LLM-Based Steganography Framework \| 基于大语言模型概率分布的隐秘信息传输框架	17	Experimental	2	Python
94	ozanunal0/Prometheus-Gateway An open-source, security-first LLM Gateway designed to provide a unified,...	16	Experimental	10	Python
95	cwest/ai-tokentrace ai-tokentrace is a Python library for GenAI cost observability. It helps...	16	Experimental	1	Python
96	krish567366/automl_self_improvement A next-gen toolkit for autonomous machine learning that automatically...	16	Experimental	1	Python
97	abhiai-git/agent_trajectory_evaluation agent_trajectory_evaluation is a Python package designed to evaluate the...	15	Experimental	—	Python
98	rupeshtiwari/pluralsight-reliability-slos-incident-management-gen-ai-systems Source code, demos, and supporting assets for a Pluralsight course on...	15	Experimental	1	Python
99	shaharia-lab/multi-llm-discussion Multi-LLM Discussion Platform - Orchestrate discussions between multiple...	15	Experimental	—	TypeScript
100	samuli/rgltr Tool Governance for Pydantic AI Agents	15	Experimental	—	Python
101	eneagizzarelli/SYNAPSE SYNAPSE (SYNthetic AI Pot for Security Enhancement) and SYNAPSE-to-MITRE...	15	Experimental	16	Python
102	Mehul-Gupta-SMH/Silver-Bullet Silver Bullet is a Python toolkit for comparing two paragraphs or documents...	15	Experimental	1	Python
103	traversaal-ai/DSBC-Data-Science-Task-Evaluation Benchmark and evaluate LLMs on data science code generation using the DSBC dataset.	14	Experimental	3	Jupyter Notebook
104	budgetguard-ai/budgetguard-core A FinOps control plane for AI APIs - Drop-in API gateway that enforces hard...	14	Experimental	4	TypeScript
105	sanika373/llm-data-quality-monitor Automated data quality monitoring using LLM (GPT-4o) to generate SQL checks...	14	Experimental	—	Python
106	meyumer55/enterprise-foundational-model-scaler A high-level framework for fine-tuning and deploying foundational models...	14	Experimental	—	Python
107	kiquetal/course-zero-trust-fundamentals O'Reilly Live Course: Zero Trust Security Fundamentals — covering Zero Trust...	14	Experimental	—	—
108	jthiruveedula/llmops-mlflow-vertexai LLMOps platform integrating MLflow experiment tracking, Vertex AI model...	14	Experimental	—	Python
109	jthiruveedula/llmops-evaluation-framework Production LLMOps platform with automated evaluation, A/B testing, prompt...	14	Experimental	—	Python
110	jthiruveedula/real-time-llm-streaming-platform Kafka + Spark Streaming + LLM inference pipeline for real-time document...	14	Experimental	—	Python
111	Naresh1401/LLM-safety-guardrails Production-ready LLM safety layer: prompt injection detection, PII...	14	Experimental	—	Python
112	GauJosh/devops-genai Production-style GenAI platform lab for CI/CD failure analysis, including...	14	Experimental	—	Python
113	oliverweissl/SMOO A testing framework for ML systems	13	Experimental	—	Python
114	BabarAli93/GAIKube [TCCN 24] GAIKube: Generative AI-based Proactive Kubernetes Container...	13	Experimental	2	Jupyter Notebook
115	awaescher/Olmolo Ollama Model Loader: Keeping Ollama models warm	12	Experimental	1	C#
116	bignacio/llama.up Provision your own LLMA backend on a public cloud provider	12	Experimental	3	HCL
117	tmam-dev/tmam tmam is an open-source observability platform that gives you deep, real-time...	12	Experimental	1	TypeScript
118	RenaudGaudron/MMLU_benchmark An easy-to-use and standardised framework for evaluating Large Language...	12	Experimental	1	Python
119	juliensimon/radar-evaluator A professional, extensible framework for evaluating and comparing Large...	12	Experimental	1	Python
120	Dineshkumar0705/atlas-ai-observability Full-stack AI Trust & Observability Platform for LLM-based Systems (FastAPI...	12	Experimental	1	Python
121	RenaudGaudron/oeis-sequences-benchmark A Python toolkit and benchmark dataset for predicting the next term in OEIS...	12	Experimental	1	Python
122	vlimkv/ai-project-tracker Full-stack AI Project Manager with Self-Hosted LLM (llama.cpp). Generates...	12	Experimental	1	Python
123	witchnya/easykubeai easy kubeai	12	Experimental	1	Python
124	ayush585/hallucination-detector Developed as part of IEM HackOsis 2.0 under Problem Statement HOGN02. Team...	12	Experimental	1	Python
125	svilupp/Spehulak.jl GenAI observability application in Julia	12	Experimental	3	Julia
126	dileepkreddy5/secure-llm-gateway Production-grade AI security middleware with async micro-batching, prompt...	12	Experimental	1	Python
127	nehamaheshh/LLM-Drift-Monitor Production-style LLM drift monitoring: semantic, structural, safety, and...	11	Experimental	—	Python
128	Deepakkasyapa11/LLMops-Computed-Grid-Training Production-centric LLMOps framework designed to bridge the gap between AI...	11	Experimental	—	Python
129	th3w1zard1/llm_fallbacks Aggregates, sorts, and organizes various GenAI LLM providers into...	11	Experimental	—	Python
130	Tarunjit45/local-ai-safety-auditor An implementation of Asynchronous AI Oversight using local Small Language...	11	Experimental	—	Python
131	oriolrius/from-mlops-to-llmops Educational materials for understanding the evolution from MLOps to LLMOps....	11	Experimental	—	Python
132	parthamehta123/cloudops-ai-monitor AI-powered CloudOps monitoring system — anomaly detection with PyTorch,...	11	Experimental	—	Python
133	cathy841106/ai-hallucination-detect A tool for detecting hallucinations in domain-specific LLM outputs. It...	11	Experimental	—	Python
134	adrianhdezm/llm-sdk This is just another SDK for the common LLM API providers.	11	Experimental	—	TypeScript
135	alexei-led/cloud-inspector EXPERIMENT: Cloud Inspector identifies cloud resources based on user...	11	Experimental	—	Python
136	charanpool/llm-cogs-optmizer Intelligent middleware that reduces LLM COGS by routing queries between...	11	Experimental	—	Python
137	rawatshaurya/llm-drift-monitor Production-style LLM drift monitoring: semantic, structural, safety, and...	11	Experimental	—	Python
138	CodeWithPraveen/ps-genai-hallucinations Course demos for identifying, mitigating, and preventing hallucinations in...	11	Experimental	—	Python
139	glzbcrt/llm-tools-on-demand Use semantic queries to find relevant tools for LLM use.	10	Experimental	1	C#
140	sezer-muhammed/GenAIJury Framework for multi-agent LLM systems to evaluate, critique, and improve...	10	Experimental	1	Python
141	ghr8635/LLM-based-Agent-for-Driver-Sleepiness-Detection-and-Mitigation-in-Automotive-Systems An AI-driven automotive agent utilizing Large Language Models (LLMs) and...	10	Experimental	3	Python
142	devopscodegen/devopscodegen-common Common python modules for all devops code generators like pipeline code...	10	Experimental	1	Python