LLM Evaluation Platforms Generative AI Tools

Tools for testing, evaluating, and monitoring LLM applications in production—including automated evaluation frameworks, A/B testing, observability, quality control, and performance tracking. Does NOT include general ML ops platforms, code generation tools, or domain-specific AI applications.

There are 142 llm evaluation platforms tools tracked. 2 score above 70 (verified tier). The highest-rated is madroidmaq/mlx-omni-server at 82/100 with 678 stars and 2,273 monthly downloads. 3 of the top 10 are actively maintained.

Get all 142 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=generative-ai&subcategory=llm-evaluation-platforms&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 madroidmaq/mlx-omni-server

MLX Omni Server is a local inference server powered by Apple's MLX...

82
Verified
2 openvinotoolkit/model_server

A scalable inference server for models optimized with OpenVINO™

74
Verified
3 rhesis-ai/rhesis

Open-source platform & SDK for testing LLM and agentic apps. Define expected...

68
Established
4 NVIDIA-NeMo/Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable...

66
Established
5 taco-group/OpenEMMA

OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.

62
Established
6 generative-computing/mellea

Mellea is a library for writing generative programs.

61
Established
7 cncf/llm-starter-pack

🤖 Get started with LLMs on your kind cluster, today!

56
Established
8 GoogleCloudDataproc/dataproc-ml-python

Library to simplify running distributed ML workloads with Apache Spark

51
Established
9 Ultrathink-Solutions/openclaw-logfire

Pydantic Logfire observability plugin for OpenClaw — OTEL GenAI semantic...

50
Established
10 modular/max-agentic-cookbook

MAX Agentic Cookbook

49
Emerging
11 aws-samples/foundation-model-benchmarking-tool

Foundation model benchmarking tool. Run any model on any AWS platform and...

48
Emerging
12 cuckoo-network/cuckoo

Cuckoo is a Decentralized AI Model-Serving Platform, starting with...

48
Emerging
13 clearml/clearml-fractional-gpu

ClearML Fractional GPU - Run multiple containers on the same GPU with driver...

47
Emerging
14 jordanvolz/lolpop

A software engineering framework to jump start your machine learning projects

46
Emerging
15 hichipli/vetting-python

A Python implementation of the VETTING (Verification and Evaluation Tool for...

43
Emerging
16 vienneraphael/batchling

Save 50% off GenAI costs in two lines of code

42
Emerging
17 Aaryanverma/trustifai

TrustifAI: A Comprehensive Framework for AI Trustworthiness

40
Emerging
18 autonomi-ai/nos

⚡️ A fast and flexible PyTorch inference server that runs locally, on any...

38
Emerging
19 svilupp/Julia-LLM-Leaderboard

Provides a platform for the Julia community to compare AI models' abilities...

38
Emerging
20 AMDResearch/intelliperf

Automated bottleneck detection and solution orchestration

37
Emerging
21 maximhq/maxim-cookbooks

Maxim is an end-to-end AI evaluation and observability platform that...

36
Emerging
22 yankeexe/ollama-manager

🦙 Manage Ollama models from your CLI!

36
Emerging
23 yonahgraphics/openevalkit

Production-grade Python framework for evaluating LLM and agentic systems...

36
Emerging
24 amazon-science/fmcore

Running Foundation Models at every scale, on every modality. Includes...

36
Emerging
25 sandner-art/ArtAgents

Framework for LLM based captioning and prompt engineering

35
Emerging
26 radlab-dev-group/llm-router

LLM Router is a service that can be deployed on‑premises or in the cloud. It...

35
Emerging
27 kstathou/llm-stack

End-to-end tech stack for the LLM data flywheel

35
Emerging
28 aimonlabs/aimon-python-sdk

This repo hosts the Python SDK and related examples for AIMon, which is a...

35
Emerging
29 soundstarrain/LLM-Filter-Probe

一款针对 LLM 输入侧审查的精确逆向分析工具。自动定位 NewAPI、OneAPI 及任何实施基于字典规则进行 Prompt 过滤的 API...

34
Emerging
30 unit-mesh/devops-genius

DevOpsGenius 旨在结合 LLM 重塑软件开发中的 DevOps 实践。将 LLM 视为团队的初级...

33
Emerging
31 amazon-science/concurry

Easy scaling for AI research and production workloads

33
Emerging
32 metanoia-oss/promptguard

Reliable, structured, production-safe LLM outputs with schema validation and...

32
Emerging
33 Finoptimize/agentaflow-sro-community

Manage AI and Machine Learning workloads more efficiently with lower cost: ...

31
Emerging
34 retkowsky/foundry-local

Foundry Local is an on-device AI inference solution that you use to run AI...

31
Emerging
35 JuryMindAI/jurymind-ai

Framework for agentic evaluation of LLMs, Prompt Optimization, Data...

28
Experimental
36 sMiNT0S/AIBugBench

From prompt to paste: evaluate AI / LLM output under a strict Python sandbox...

28
Experimental
37 DanTheAI/LLM-Middleware-Pipeline

A modular, configurable LLM middleware pipeline that transforms raw prompts...

27
Experimental
38 Aryan-202/cookbooks

An intelligent optimization engine that dynamically adjusts LLM selection,...

27
Experimental
39 Impesud/ai-mlops-project

AI MLOps Project – A production-grade MLOps pipeline for scalable,...

26
Experimental
40 llm-platform-security/gpt-data-exposure

An In-Depth Investigation of Data Collection in LLM App Ecosystems

26
Experimental
41 Generative-Engine-Marketing/GEM-Bench

First comprehensive benchmark for Generative Engine Marketing (GEM), an...

26
Experimental
42 LLMConsent/llmconsent-standards

LLMConsent is an open protocol that establishes standards for managing...

25
Experimental
43 rpjayaraman/LLMxVLSI

Generate, Simulate & Summarize Verilog Code with GenAI and Iverilog tool

25
Experimental
44 paralleliq/piqc-knowledge-base

Production-ready checklists and frameworks for deploying LLMs, GenAI models,...

24
Experimental
45 djokester/groqeval

Use groq for evaluations

24
Experimental
46 fmind/mlops-digester

A tool equipping Pydantic AI agents with the ability to digest and summarize...

24
Experimental
47 squishai/squish

🤖🗜️⚡️ Compress local LLMs once, run them forever at sub-second load times....

24
Experimental
48 hiamitabha/genai-bench

Code to benchmark APIs available from LLM vendors and demostrate how they work

24
Experimental
49 nginH/llmforge

One API, every AI model, instant switching. Change from GPT-4 to Gemini to...

24
Experimental
50 Yapakayala/cloudops-ai-monitor

🔍 Monitor cloud environments with AI-driven insights, anomaly detection, and...

23
Experimental
51 iservicebus/lmaas

LMaaS (Language Model as a Service) abstracts away complexities and enables...

23
Experimental
52 evalops/eval2otel

Library to convert AI evaluation results to OpenTelemetry GenAI semantic...

23
Experimental
53 noct-ml/noesis

Noesis - A lightweight toolkit for inspecting transformer internals through...

23
Experimental
54 wesleyscholl/squish

🤖🗜️⚡️ Compress local LLMs once, run them forever at sub-second load times....

23
Experimental
55 SangiSI/llm-model-selection-lab

Decision-centric evaluation lab for intelligent LLM model selection using...

23
Experimental
56 last9/python-ai-sdk

OpenTelemetry extension for LLM observability - track conversations,...

23
Experimental
57 Ashik245-commits/LLM-Filter-Probe

🕵️♂️ Analyze and reverse engineer keyword filtering in large language models...

23
Experimental
58 SAP-samples/llm-round-trip-correctness

This repo provides code for evaluation of llm round-trip-correctness on text...

23
Experimental
59 sugihAF/DomainBench

LLM Benchmark and Comparison on Domain Specific Implementation

23
Experimental
60 josephlash10-svg/Glass-Box

A Python-based framework for managing LLM drift and preventing model...

23
Experimental
61 maharshijani05/CivicMind

CivicMind is an AI-powered civic policy simulator where intelligent agents...

22
Experimental
62 verma-kunal/k8sGPT-tutorial

This repo is dedicated for the K8sGPT tutorial on Kubesimplify's YT channel.

22
Experimental
63 Mrdodo446/ModelForge

Build and customize machine learning models efficiently with an open-source...

22
Experimental
64 mauryasameer/llm_eval

SR 11-7 & EU AI Act compliant LLM validation framework for financial...

22
Experimental
65 Retamoso23/ollixir

🤖 Enable local large language models with Ollixir, the Elixir client...

22
Experimental
66 xxxihrmn/llmops

🚀 Discover top tools and resources for Large Language Model Operations...

22
Experimental
67 nyno-ai/nynoflow

Production grade framework for LLM application development

22
Experimental
68 AdityaPatange1/okesa

Okesa: LLM-powered Natural Language Processing! 💬

22
Experimental
69 ravikirankrishnaprasad/multi-agent-hallucination-detection-and-correction

Multi-agent framework for hallucination detection and correction in LLM...

22
Experimental
70 danilop/llm-test-mate

A simple testing framework to evaluate and validate LLM-generated content...

21
Experimental
71 Shyam-Sundar-Raju/Consensus

CONSENSUS — A learning-aware generative AI system using a multi-agent LLM...

20
Experimental
72 demml/potatohead

🥔 Quality control for your potato heads (LLMs)

20
Experimental
73 Portkey-AI/helm-chart

Kubernetes Configs for Portkey Gateway deployment

20
Experimental
74 leaxer-ai/leaxer

An engine for local AI inference, built on Elixir and the BEAM virtual machine.

20
Experimental
75 Ratnesh-181998/Production-Ready-MLOps-Pipelines

Production-grade MLOps pipelines with real-world ML and NLP projects.Covers...

20
Experimental
76 robocorp/llmfoo

Code with the flow of a river, refactor with the grace of a breeze, and...

20
Experimental
77 Tradunsky/3D-guardrails

3D content you can trust

20
Experimental
78 valohai/valohai-llm

Track and report LLM and GenAI evaluations to Valohai LLM

20
Experimental
79 radlab-dev-group/llm-router-plugins

A companion repository for llm-router containing a collection of...

20
Experimental
80 sochaty/llm-governance-engine

A robust LLM Governance & ROI Evaluation platform designed to benchmark...

19
Experimental
81 umbertocicciaa/devopsfix

Fix cicd pipeline using generative AI

19
Experimental
82 adityonugrohoid/ollama-runtime

Shared Ollama LLM runtime for the GenAI Portfolio Suite. GPU-accelerated...

19
Experimental
83 Yu-amd/Multiverse

Lightweight model inference playground

19
Experimental
84 hari7261/indus-llm-gateway

Production-ready LLM gateway — unified OpenAI-compatible API for all...

19
Experimental
85 infinitum-nihil/otel-genai-safety-semconv

Proposed OpenTelemetry semantic conventions for GenAI safety system telemetry

19
Experimental
86 Lavaver/OpenVINO-GenAI-Toolkit

This repository provides a post-installation utility suite for OpenVINO,...

19
Experimental
87 adityonugrohoid/ollama-multi-llm-server

Multi-model inference API and playground powered by Ollama. Serve, switch,...

19
Experimental
88 mkhomutskyi/illama

Ollama-like LLM experience for Intel Arc GPUs (B50/A770/A750) using...

19
Experimental
89 korkridake/GenAIOps-OSS

A unified handbook for building, deploying and understanding LLM agents and...

19
Experimental
90 hipvlady/subzero

Project SubZeo: Zero Trust AI Gateway (ZTAG)

18
Experimental
91 sharonccccc/AIFE_GEN-MLOps_Platform

AI capability development platform using AutoML and AutoGluon

18
Experimental
92 svilupp/Logfire.jl

Observability for Julia LLM applications. Know what your AI is doing.

17
Experimental
93 sylym/subtext

LLM-Based Steganography Framework | 基于大语言模型概率分布的隐秘信息传输框架

17
Experimental
94 ozanunal0/Prometheus-Gateway

An open-source, security-first LLM Gateway designed to provide a unified,...

16
Experimental
95 cwest/ai-tokentrace

ai-tokentrace is a Python library for GenAI cost observability. It helps...

16
Experimental
96 krish567366/automl_self_improvement

A next-gen toolkit for autonomous machine learning that automatically...

16
Experimental
97 abhiai-git/agent_trajectory_evaluation

agent_trajectory_evaluation is a Python package designed to evaluate the...

15
Experimental
98 rupeshtiwari/pluralsight-reliability-slos-incident-management-gen-ai-systems

Source code, demos, and supporting assets for a Pluralsight course on...

15
Experimental
99 shaharia-lab/multi-llm-discussion

Multi-LLM Discussion Platform - Orchestrate discussions between multiple...

15
Experimental
100 samuli/rgltr

Tool Governance for Pydantic AI Agents

15
Experimental
101 eneagizzarelli/SYNAPSE

SYNAPSE (SYNthetic AI Pot for Security Enhancement) and SYNAPSE-to-MITRE...

15
Experimental
102 Mehul-Gupta-SMH/Silver-Bullet

Silver Bullet is a Python toolkit for comparing two paragraphs or documents...

15
Experimental
103 traversaal-ai/DSBC-Data-Science-Task-Evaluation

Benchmark and evaluate LLMs on data science code generation using the DSBC dataset.

14
Experimental
104 budgetguard-ai/budgetguard-core

A FinOps control plane for AI APIs - Drop-in API gateway that enforces hard...

14
Experimental
105 sanika373/llm-data-quality-monitor

Automated data quality monitoring using LLM (GPT-4o) to generate SQL checks...

14
Experimental
106 meyumer55/enterprise-foundational-model-scaler

A high-level framework for fine-tuning and deploying foundational models...

14
Experimental
107 kiquetal/course-zero-trust-fundamentals

O'Reilly Live Course: Zero Trust Security Fundamentals — covering Zero Trust...

14
Experimental
108 jthiruveedula/llmops-mlflow-vertexai

LLMOps platform integrating MLflow experiment tracking, Vertex AI model...

14
Experimental
109 jthiruveedula/llmops-evaluation-framework

Production LLMOps platform with automated evaluation, A/B testing, prompt...

14
Experimental
110 jthiruveedula/real-time-llm-streaming-platform

Kafka + Spark Streaming + LLM inference pipeline for real-time document...

14
Experimental
111 Naresh1401/LLM-safety-guardrails

Production-ready LLM safety layer: prompt injection detection, PII...

14
Experimental
112 GauJosh/devops-genai

Production-style GenAI platform lab for CI/CD failure analysis, including...

14
Experimental
113 oliverweissl/SMOO

A testing framework for ML systems

13
Experimental
114 BabarAli93/GAIKube

[TCCN 24] GAIKube: Generative AI-based Proactive Kubernetes Container...

13
Experimental
115 awaescher/Olmolo

Ollama Model Loader: Keeping Ollama models warm

12
Experimental
116 bignacio/llama.up

Provision your own LLMA backend on a public cloud provider

12
Experimental
117 tmam-dev/tmam

tmam is an open-source observability platform that gives you deep, real-time...

12
Experimental
118 RenaudGaudron/MMLU_benchmark

An easy-to-use and standardised framework for evaluating Large Language...

12
Experimental
119 juliensimon/radar-evaluator

A professional, extensible framework for evaluating and comparing Large...

12
Experimental
120 Dineshkumar0705/atlas-ai-observability

Full-stack AI Trust & Observability Platform for LLM-based Systems (FastAPI...

12
Experimental
121 RenaudGaudron/oeis-sequences-benchmark

A Python toolkit and benchmark dataset for predicting the next term in OEIS...

12
Experimental
122 vlimkv/ai-project-tracker

Full-stack AI Project Manager with Self-Hosted LLM (llama.cpp). Generates...

12
Experimental
123 witchnya/easykubeai

easy kubeai

12
Experimental
124 ayush585/hallucination-detector

Developed as part of IEM HackOsis 2.0 under Problem Statement HOGN02. Team...

12
Experimental
125 svilupp/Spehulak.jl

GenAI observability application in Julia

12
Experimental
126 dileepkreddy5/secure-llm-gateway

Production-grade AI security middleware with async micro-batching, prompt...

12
Experimental
127 nehamaheshh/LLM-Drift-Monitor

Production-style LLM drift monitoring: semantic, structural, safety, and...

11
Experimental
128 Deepakkasyapa11/LLMops-Computed-Grid-Training

Production-centric LLMOps framework designed to bridge the gap between AI...

11
Experimental
129 th3w1zard1/llm_fallbacks

Aggregates, sorts, and organizes various GenAI LLM providers into...

11
Experimental
130 Tarunjit45/local-ai-safety-auditor

An implementation of Asynchronous AI Oversight using local Small Language...

11
Experimental
131 oriolrius/from-mlops-to-llmops

Educational materials for understanding the evolution from MLOps to LLMOps....

11
Experimental
132 parthamehta123/cloudops-ai-monitor

AI-powered CloudOps monitoring system — anomaly detection with PyTorch,...

11
Experimental
133 cathy841106/ai-hallucination-detect

A tool for detecting hallucinations in domain-specific LLM outputs. It...

11
Experimental
134 adrianhdezm/llm-sdk

This is just another SDK for the common LLM API providers.

11
Experimental
135 alexei-led/cloud-inspector

EXPERIMENT: Cloud Inspector identifies cloud resources based on user...

11
Experimental
136 charanpool/llm-cogs-optmizer

Intelligent middleware that reduces LLM COGS by routing queries between...

11
Experimental
137 rawatshaurya/llm-drift-monitor

Production-style LLM drift monitoring: semantic, structural, safety, and...

11
Experimental
138 CodeWithPraveen/ps-genai-hallucinations

Course demos for identifying, mitigating, and preventing hallucinations in...

11
Experimental
139 glzbcrt/llm-tools-on-demand

Use semantic queries to find relevant tools for LLM use.

10
Experimental
140 sezer-muhammed/GenAIJury

Framework for multi-agent LLM systems to evaluate, critique, and improve...

10
Experimental
141 ghr8635/LLM-based-Agent-for-Driver-Sleepiness-Detection-and-Mitigation-in-Automotive-Systems

An AI-driven automotive agent utilizing Large Language Models (LLMs) and...

10
Experimental
142 devopscodegen/devopscodegen-common

Common python modules for all devops code generators like pipeline code...

10
Experimental