LLM Evaluation Platforms Generative AI Tools
Tools for testing, evaluating, and monitoring LLM applications in production—including automated evaluation frameworks, A/B testing, observability, quality control, and performance tracking. Does NOT include general ML ops platforms, code generation tools, or domain-specific AI applications.
There are 142 llm evaluation platforms tools tracked. 2 score above 70 (verified tier). The highest-rated is madroidmaq/mlx-omni-server at 82/100 with 678 stars and 2,273 monthly downloads. 3 of the top 10 are actively maintained.
Get all 142 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=generative-ai&subcategory=llm-evaluation-platforms&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
madroidmaq/mlx-omni-server
MLX Omni Server is a local inference server powered by Apple's MLX... |
|
Verified |
| 2 |
openvinotoolkit/model_server
A scalable inference server for models optimized with OpenVINO™ |
|
Verified |
| 3 |
rhesis-ai/rhesis
Open-source platform & SDK for testing LLM and agentic apps. Define expected... |
|
Established |
| 4 |
NVIDIA-NeMo/Guardrails
NeMo Guardrails is an open-source toolkit for easily adding programmable... |
|
Established |
| 5 |
taco-group/OpenEMMA
OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model. |
|
Established |
| 6 |
generative-computing/mellea
Mellea is a library for writing generative programs. |
|
Established |
| 7 |
cncf/llm-starter-pack
🤖 Get started with LLMs on your kind cluster, today! |
|
Established |
| 8 |
GoogleCloudDataproc/dataproc-ml-python
Library to simplify running distributed ML workloads with Apache Spark |
|
Established |
| 9 |
Ultrathink-Solutions/openclaw-logfire
Pydantic Logfire observability plugin for OpenClaw — OTEL GenAI semantic... |
|
Established |
| 10 |
modular/max-agentic-cookbook
MAX Agentic Cookbook |
|
Emerging |
| 11 |
aws-samples/foundation-model-benchmarking-tool
Foundation model benchmarking tool. Run any model on any AWS platform and... |
|
Emerging |
| 12 |
cuckoo-network/cuckoo
Cuckoo is a Decentralized AI Model-Serving Platform, starting with... |
|
Emerging |
| 13 |
clearml/clearml-fractional-gpu
ClearML Fractional GPU - Run multiple containers on the same GPU with driver... |
|
Emerging |
| 14 |
jordanvolz/lolpop
A software engineering framework to jump start your machine learning projects |
|
Emerging |
| 15 |
hichipli/vetting-python
A Python implementation of the VETTING (Verification and Evaluation Tool for... |
|
Emerging |
| 16 |
vienneraphael/batchling
Save 50% off GenAI costs in two lines of code |
|
Emerging |
| 17 |
Aaryanverma/trustifai
TrustifAI: A Comprehensive Framework for AI Trustworthiness |
|
Emerging |
| 18 |
autonomi-ai/nos
⚡️ A fast and flexible PyTorch inference server that runs locally, on any... |
|
Emerging |
| 19 |
svilupp/Julia-LLM-Leaderboard
Provides a platform for the Julia community to compare AI models' abilities... |
|
Emerging |
| 20 |
AMDResearch/intelliperf
Automated bottleneck detection and solution orchestration |
|
Emerging |
| 21 |
maximhq/maxim-cookbooks
Maxim is an end-to-end AI evaluation and observability platform that... |
|
Emerging |
| 22 |
yankeexe/ollama-manager
🦙 Manage Ollama models from your CLI! |
|
Emerging |
| 23 |
yonahgraphics/openevalkit
Production-grade Python framework for evaluating LLM and agentic systems... |
|
Emerging |
| 24 |
amazon-science/fmcore
Running Foundation Models at every scale, on every modality. Includes... |
|
Emerging |
| 25 |
sandner-art/ArtAgents
Framework for LLM based captioning and prompt engineering |
|
Emerging |
| 26 |
radlab-dev-group/llm-router
LLM Router is a service that can be deployed on‑premises or in the cloud. It... |
|
Emerging |
| 27 |
kstathou/llm-stack
End-to-end tech stack for the LLM data flywheel |
|
Emerging |
| 28 |
aimonlabs/aimon-python-sdk
This repo hosts the Python SDK and related examples for AIMon, which is a... |
|
Emerging |
| 29 |
soundstarrain/LLM-Filter-Probe
一款针对 LLM 输入侧审查的精确逆向分析工具。自动定位 NewAPI、OneAPI 及任何实施基于字典规则进行 Prompt 过滤的 API... |
|
Emerging |
| 30 |
unit-mesh/devops-genius
DevOpsGenius 旨在结合 LLM 重塑软件开发中的 DevOps 实践。将 LLM 视为团队的初级... |
|
Emerging |
| 31 |
amazon-science/concurry
Easy scaling for AI research and production workloads |
|
Emerging |
| 32 |
metanoia-oss/promptguard
Reliable, structured, production-safe LLM outputs with schema validation and... |
|
Emerging |
| 33 |
Finoptimize/agentaflow-sro-community
Manage AI and Machine Learning workloads more efficiently with lower cost: ... |
|
Emerging |
| 34 |
retkowsky/foundry-local
Foundry Local is an on-device AI inference solution that you use to run AI... |
|
Emerging |
| 35 |
JuryMindAI/jurymind-ai
Framework for agentic evaluation of LLMs, Prompt Optimization, Data... |
|
Experimental |
| 36 |
sMiNT0S/AIBugBench
From prompt to paste: evaluate AI / LLM output under a strict Python sandbox... |
|
Experimental |
| 37 |
DanTheAI/LLM-Middleware-Pipeline
A modular, configurable LLM middleware pipeline that transforms raw prompts... |
|
Experimental |
| 38 |
Aryan-202/cookbooks
An intelligent optimization engine that dynamically adjusts LLM selection,... |
|
Experimental |
| 39 |
Impesud/ai-mlops-project
AI MLOps Project – A production-grade MLOps pipeline for scalable,... |
|
Experimental |
| 40 |
llm-platform-security/gpt-data-exposure
An In-Depth Investigation of Data Collection in LLM App Ecosystems |
|
Experimental |
| 41 |
Generative-Engine-Marketing/GEM-Bench
First comprehensive benchmark for Generative Engine Marketing (GEM), an... |
|
Experimental |
| 42 |
LLMConsent/llmconsent-standards
LLMConsent is an open protocol that establishes standards for managing... |
|
Experimental |
| 43 |
rpjayaraman/LLMxVLSI
Generate, Simulate & Summarize Verilog Code with GenAI and Iverilog tool |
|
Experimental |
| 44 |
paralleliq/piqc-knowledge-base
Production-ready checklists and frameworks for deploying LLMs, GenAI models,... |
|
Experimental |
| 45 |
djokester/groqeval
Use groq for evaluations |
|
Experimental |
| 46 |
fmind/mlops-digester
A tool equipping Pydantic AI agents with the ability to digest and summarize... |
|
Experimental |
| 47 |
squishai/squish
🤖🗜️⚡️ Compress local LLMs once, run them forever at sub-second load times.... |
|
Experimental |
| 48 |
hiamitabha/genai-bench
Code to benchmark APIs available from LLM vendors and demostrate how they work |
|
Experimental |
| 49 |
nginH/llmforge
One API, every AI model, instant switching. Change from GPT-4 to Gemini to... |
|
Experimental |
| 50 |
Yapakayala/cloudops-ai-monitor
🔍 Monitor cloud environments with AI-driven insights, anomaly detection, and... |
|
Experimental |
| 51 |
iservicebus/lmaas
LMaaS (Language Model as a Service) abstracts away complexities and enables... |
|
Experimental |
| 52 |
evalops/eval2otel
Library to convert AI evaluation results to OpenTelemetry GenAI semantic... |
|
Experimental |
| 53 |
noct-ml/noesis
Noesis - A lightweight toolkit for inspecting transformer internals through... |
|
Experimental |
| 54 |
wesleyscholl/squish
🤖🗜️⚡️ Compress local LLMs once, run them forever at sub-second load times.... |
|
Experimental |
| 55 |
SangiSI/llm-model-selection-lab
Decision-centric evaluation lab for intelligent LLM model selection using... |
|
Experimental |
| 56 |
last9/python-ai-sdk
OpenTelemetry extension for LLM observability - track conversations,... |
|
Experimental |
| 57 |
Ashik245-commits/LLM-Filter-Probe
🕵️♂️ Analyze and reverse engineer keyword filtering in large language models... |
|
Experimental |
| 58 |
SAP-samples/llm-round-trip-correctness
This repo provides code for evaluation of llm round-trip-correctness on text... |
|
Experimental |
| 59 |
sugihAF/DomainBench
LLM Benchmark and Comparison on Domain Specific Implementation |
|
Experimental |
| 60 |
josephlash10-svg/Glass-Box
A Python-based framework for managing LLM drift and preventing model... |
|
Experimental |
| 61 |
maharshijani05/CivicMind
CivicMind is an AI-powered civic policy simulator where intelligent agents... |
|
Experimental |
| 62 |
verma-kunal/k8sGPT-tutorial
This repo is dedicated for the K8sGPT tutorial on Kubesimplify's YT channel. |
|
Experimental |
| 63 |
Mrdodo446/ModelForge
Build and customize machine learning models efficiently with an open-source... |
|
Experimental |
| 64 |
mauryasameer/llm_eval
SR 11-7 & EU AI Act compliant LLM validation framework for financial... |
|
Experimental |
| 65 |
Retamoso23/ollixir
🤖 Enable local large language models with Ollixir, the Elixir client... |
|
Experimental |
| 66 |
xxxihrmn/llmops
🚀 Discover top tools and resources for Large Language Model Operations... |
|
Experimental |
| 67 |
nyno-ai/nynoflow
Production grade framework for LLM application development |
|
Experimental |
| 68 |
AdityaPatange1/okesa
Okesa: LLM-powered Natural Language Processing! 💬 |
|
Experimental |
| 69 |
ravikirankrishnaprasad/multi-agent-hallucination-detection-and-correction
Multi-agent framework for hallucination detection and correction in LLM... |
|
Experimental |
| 70 |
danilop/llm-test-mate
A simple testing framework to evaluate and validate LLM-generated content... |
|
Experimental |
| 71 |
Shyam-Sundar-Raju/Consensus
CONSENSUS — A learning-aware generative AI system using a multi-agent LLM... |
|
Experimental |
| 72 |
demml/potatohead
🥔 Quality control for your potato heads (LLMs) |
|
Experimental |
| 73 |
Portkey-AI/helm-chart
Kubernetes Configs for Portkey Gateway deployment |
|
Experimental |
| 74 |
leaxer-ai/leaxer
An engine for local AI inference, built on Elixir and the BEAM virtual machine. |
|
Experimental |
| 75 |
Ratnesh-181998/Production-Ready-MLOps-Pipelines
Production-grade MLOps pipelines with real-world ML and NLP projects.Covers... |
|
Experimental |
| 76 |
robocorp/llmfoo
Code with the flow of a river, refactor with the grace of a breeze, and... |
|
Experimental |
| 77 |
Tradunsky/3D-guardrails
3D content you can trust |
|
Experimental |
| 78 |
valohai/valohai-llm
Track and report LLM and GenAI evaluations to Valohai LLM |
|
Experimental |
| 79 |
radlab-dev-group/llm-router-plugins
A companion repository for llm-router containing a collection of... |
|
Experimental |
| 80 |
sochaty/llm-governance-engine
A robust LLM Governance & ROI Evaluation platform designed to benchmark... |
|
Experimental |
| 81 |
umbertocicciaa/devopsfix
Fix cicd pipeline using generative AI |
|
Experimental |
| 82 |
adityonugrohoid/ollama-runtime
Shared Ollama LLM runtime for the GenAI Portfolio Suite. GPU-accelerated... |
|
Experimental |
| 83 |
Yu-amd/Multiverse
Lightweight model inference playground |
|
Experimental |
| 84 |
hari7261/indus-llm-gateway
Production-ready LLM gateway — unified OpenAI-compatible API for all... |
|
Experimental |
| 85 |
infinitum-nihil/otel-genai-safety-semconv
Proposed OpenTelemetry semantic conventions for GenAI safety system telemetry |
|
Experimental |
| 86 |
Lavaver/OpenVINO-GenAI-Toolkit
This repository provides a post-installation utility suite for OpenVINO,... |
|
Experimental |
| 87 |
adityonugrohoid/ollama-multi-llm-server
Multi-model inference API and playground powered by Ollama. Serve, switch,... |
|
Experimental |
| 88 |
mkhomutskyi/illama
Ollama-like LLM experience for Intel Arc GPUs (B50/A770/A750) using... |
|
Experimental |
| 89 |
korkridake/GenAIOps-OSS
A unified handbook for building, deploying and understanding LLM agents and... |
|
Experimental |
| 90 |
hipvlady/subzero
Project SubZeo: Zero Trust AI Gateway (ZTAG) |
|
Experimental |
| 91 |
sharonccccc/AIFE_GEN-MLOps_Platform
AI capability development platform using AutoML and AutoGluon |
|
Experimental |
| 92 |
svilupp/Logfire.jl
Observability for Julia LLM applications. Know what your AI is doing. |
|
Experimental |
| 93 |
sylym/subtext
LLM-Based Steganography Framework | 基于大语言模型概率分布的隐秘信息传输框架 |
|
Experimental |
| 94 |
ozanunal0/Prometheus-Gateway
An open-source, security-first LLM Gateway designed to provide a unified,... |
|
Experimental |
| 95 |
cwest/ai-tokentrace
ai-tokentrace is a Python library for GenAI cost observability. It helps... |
|
Experimental |
| 96 |
krish567366/automl_self_improvement
A next-gen toolkit for autonomous machine learning that automatically... |
|
Experimental |
| 97 |
abhiai-git/agent_trajectory_evaluation
agent_trajectory_evaluation is a Python package designed to evaluate the... |
|
Experimental |
| 98 |
rupeshtiwari/pluralsight-reliability-slos-incident-management-gen-ai-systems
Source code, demos, and supporting assets for a Pluralsight course on... |
|
Experimental |
| 99 |
shaharia-lab/multi-llm-discussion
Multi-LLM Discussion Platform - Orchestrate discussions between multiple... |
|
Experimental |
| 100 |
samuli/rgltr
Tool Governance for Pydantic AI Agents |
|
Experimental |
| 101 |
eneagizzarelli/SYNAPSE
SYNAPSE (SYNthetic AI Pot for Security Enhancement) and SYNAPSE-to-MITRE... |
|
Experimental |
| 102 |
Mehul-Gupta-SMH/Silver-Bullet
Silver Bullet is a Python toolkit for comparing two paragraphs or documents... |
|
Experimental |
| 103 |
traversaal-ai/DSBC-Data-Science-Task-Evaluation
Benchmark and evaluate LLMs on data science code generation using the DSBC dataset. |
|
Experimental |
| 104 |
budgetguard-ai/budgetguard-core
A FinOps control plane for AI APIs - Drop-in API gateway that enforces hard... |
|
Experimental |
| 105 |
sanika373/llm-data-quality-monitor
Automated data quality monitoring using LLM (GPT-4o) to generate SQL checks... |
|
Experimental |
| 106 |
meyumer55/enterprise-foundational-model-scaler
A high-level framework for fine-tuning and deploying foundational models... |
|
Experimental |
| 107 |
kiquetal/course-zero-trust-fundamentals
O'Reilly Live Course: Zero Trust Security Fundamentals — covering Zero Trust... |
|
Experimental |
| 108 |
jthiruveedula/llmops-mlflow-vertexai
LLMOps platform integrating MLflow experiment tracking, Vertex AI model... |
|
Experimental |
| 109 |
jthiruveedula/llmops-evaluation-framework
Production LLMOps platform with automated evaluation, A/B testing, prompt... |
|
Experimental |
| 110 |
jthiruveedula/real-time-llm-streaming-platform
Kafka + Spark Streaming + LLM inference pipeline for real-time document... |
|
Experimental |
| 111 |
Naresh1401/LLM-safety-guardrails
Production-ready LLM safety layer: prompt injection detection, PII... |
|
Experimental |
| 112 |
GauJosh/devops-genai
Production-style GenAI platform lab for CI/CD failure analysis, including... |
|
Experimental |
| 113 |
oliverweissl/SMOO
A testing framework for ML systems |
|
Experimental |
| 114 |
BabarAli93/GAIKube
[TCCN 24] GAIKube: Generative AI-based Proactive Kubernetes Container... |
|
Experimental |
| 115 |
awaescher/Olmolo
Ollama Model Loader: Keeping Ollama models warm |
|
Experimental |
| 116 |
bignacio/llama.up
Provision your own LLMA backend on a public cloud provider |
|
Experimental |
| 117 |
tmam-dev/tmam
tmam is an open-source observability platform that gives you deep, real-time... |
|
Experimental |
| 118 |
RenaudGaudron/MMLU_benchmark
An easy-to-use and standardised framework for evaluating Large Language... |
|
Experimental |
| 119 |
juliensimon/radar-evaluator
A professional, extensible framework for evaluating and comparing Large... |
|
Experimental |
| 120 |
Dineshkumar0705/atlas-ai-observability
Full-stack AI Trust & Observability Platform for LLM-based Systems (FastAPI... |
|
Experimental |
| 121 |
RenaudGaudron/oeis-sequences-benchmark
A Python toolkit and benchmark dataset for predicting the next term in OEIS... |
|
Experimental |
| 122 |
vlimkv/ai-project-tracker
Full-stack AI Project Manager with Self-Hosted LLM (llama.cpp). Generates... |
|
Experimental |
| 123 |
witchnya/easykubeai
easy kubeai |
|
Experimental |
| 124 |
ayush585/hallucination-detector
Developed as part of IEM HackOsis 2.0 under Problem Statement HOGN02. Team... |
|
Experimental |
| 125 |
svilupp/Spehulak.jl
GenAI observability application in Julia |
|
Experimental |
| 126 |
dileepkreddy5/secure-llm-gateway
Production-grade AI security middleware with async micro-batching, prompt... |
|
Experimental |
| 127 |
nehamaheshh/LLM-Drift-Monitor
Production-style LLM drift monitoring: semantic, structural, safety, and... |
|
Experimental |
| 128 |
Deepakkasyapa11/LLMops-Computed-Grid-Training
Production-centric LLMOps framework designed to bridge the gap between AI... |
|
Experimental |
| 129 |
th3w1zard1/llm_fallbacks
Aggregates, sorts, and organizes various GenAI LLM providers into... |
|
Experimental |
| 130 |
Tarunjit45/local-ai-safety-auditor
An implementation of Asynchronous AI Oversight using local Small Language... |
|
Experimental |
| 131 |
oriolrius/from-mlops-to-llmops
Educational materials for understanding the evolution from MLOps to LLMOps.... |
|
Experimental |
| 132 |
parthamehta123/cloudops-ai-monitor
AI-powered CloudOps monitoring system — anomaly detection with PyTorch,... |
|
Experimental |
| 133 |
cathy841106/ai-hallucination-detect
A tool for detecting hallucinations in domain-specific LLM outputs. It... |
|
Experimental |
| 134 |
adrianhdezm/llm-sdk
This is just another SDK for the common LLM API providers. |
|
Experimental |
| 135 |
alexei-led/cloud-inspector
EXPERIMENT: Cloud Inspector identifies cloud resources based on user... |
|
Experimental |
| 136 |
charanpool/llm-cogs-optmizer
Intelligent middleware that reduces LLM COGS by routing queries between... |
|
Experimental |
| 137 |
rawatshaurya/llm-drift-monitor
Production-style LLM drift monitoring: semantic, structural, safety, and... |
|
Experimental |
| 138 |
CodeWithPraveen/ps-genai-hallucinations
Course demos for identifying, mitigating, and preventing hallucinations in... |
|
Experimental |
| 139 |
glzbcrt/llm-tools-on-demand
Use semantic queries to find relevant tools for LLM use. |
|
Experimental |
| 140 |
sezer-muhammed/GenAIJury
Framework for multi-agent LLM systems to evaluate, critique, and improve... |
|
Experimental |
| 141 |
ghr8635/LLM-based-Agent-for-Driver-Sleepiness-Detection-and-Mitigation-in-Automotive-Systems
An AI-driven automotive agent utilizing Large Language Models (LLMs) and... |
|
Experimental |
| 142 |
devopscodegen/devopscodegen-common
Common python modules for all devops code generators like pipeline code... |
|
Experimental |