LLM Observability Platforms Prompt Engineering Tools

Tools for monitoring, tracing, evaluating, and debugging LLM applications in production. Includes end-to-end observability, real-time metrics, automated evals, and prompt management dashboards. Does NOT include general application monitoring, synthetic data generation, or agent training frameworks.

There are 27 llm observability platforms tools tracked. 5 score above 70 (verified tier). The highest-rated is langfuse/langfuse at 95/100 with 23,106 stars and 3,912,905 monthly downloads. 6 of the top 10 are actively maintained.

Get all 27 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=prompt-engineering&subcategory=llm-observability-platforms&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	langfuse/langfuse 🪢 Open source LLM engineering platform: LLM Observability, metrics, evals,...	95	Verified	23,106	TypeScript
2	Arize-ai/phoenix AI Observability & Evaluation	94	Verified	8,847	Jupyter Notebook
3	Mirascope/mirascope The LLM Anti-Framework	87	Verified	1,425	Python
4	Helicone/helicone 🧊 Open source LLM observability platform. One line of code to monitor,...	81	Verified	5,237	TypeScript
5	Agenta-AI/agenta The open-source LLMOps platform: prompt playground, prompt management, LLM...	72	Verified	3,923	TypeScript
6	algorithmicsuperintelligence/optillm Optimizing inference proxy for LLMs	62	Established	3,377	Python
7	TensorOpsAI/LLMstudio Framework to bring LLM applications to production	60	Established	371	Python
8	Scale3-Labs/langtrace Langtrace 🔍 is an open-source, Open Telemetry based end-to-end...	44	Emerging	1,184	TypeScript
9	langfuse/langfuse-java 🪢 Auto-generated Java Client for Langfuse API	42	Emerging	52	Java
10	AnchoringAI/anchoring-ai An open-source no-code tool for teams to collaborate on building,...	39	Emerging	155	JavaScript
11	tenemos/langwatch The open LLM Ops platform - Traces, Analytics, Evaluations, Datasets and...	36	Emerging	2	TypeScript
12	whylabs/langkit 🔍 LangKit: An open-source toolkit for monitoring Large Language Models...	36	Emerging	976	Jupyter Notebook
13	TrentPierce/PolyCouncil PolyCouncil is an open-source multi-model deliberation engine for LM Studio....	36	Emerging	26	Python
14	brokle-ai/brokle The AI engineering platform for AI teams. Observability, evaluation, and...	35	Emerging	3	Go
15	alpha-one-index/ai-llmops-index Comprehensive LLMOps reference index: observability platforms, inference...	23	Experimental	1	Python
16	as32608/openinspector A lightweight, local-first observability proxy and dashboard designed to...	22	Experimental	—	TypeScript
17	alebgn1/ai-llmops-index Provide a comprehensive, regularly updated index of AI LLM providers,...	22	Experimental	—	Python
18	chirindaopensource/multi_agent_system_architecture_for_federal_funds_target_rate_prediction End-to-End Python implementation of "FedSight AI" multi-agent system for...	18	Experimental	4	Jupyter Notebook
19	ksm26/Evaluating-AI-Agents A hands-on course repository for Evaluating AI Agents, created with Arize...	16	Experimental	1	Jupyter Notebook
20	Uplay111/Loki-s-Insight- A lightweight visual dashboard to inspect and edit OpenClaw AI agent memory...	15	Experimental	1	HTML
21	vshwsh/prod-evals-cookbook 🎯 Build effective AI evaluations through a hands-on tutorial, using a...	14	Experimental	—	Python
22	Tarunjit45/ModelPulse ModelPulse helps maintain model reliability and performance by providing...	12	Experimental	1	Python
23	MagicTeaMC/dnsLM dnsLM: Where AI meets DNS—because even domains deserve a little intelligence!	12	Experimental	3	Python
24	VicRejkia/LLM-Sherpa A Python GUI tool to package a codebase into a single, context-rich Markdown...	11	Experimental	—	Python
25	alhemdrew/self-hosted-llm-infrastructure Deployment of a self-hosted LLM infrastructure using Ollama and Open WebUI...	11	Experimental	—	—
26	marco-ruiz/llm-repo Framework that translates LLM responses to structured data models	11	Experimental	—	Java
27	rahatmoktadir03/llm-evaluation-platform A full-stack web application for comparing and analyzing the performance of...	10	Experimental	1	TypeScript

Comparisons in this category

langfuse and phoenix (95 vs 94) langfuse and helicone (95 vs 81) langfuse and agenta (95 vs 72) langfuse and langtrace (95 vs 44) langfuse and langkit (95 vs 36) langfuse and LLMstudio (95 vs 60) langfuse and langfuse-java (95 vs 42) langfuse and brokle (95 vs 35) langfuse and langwatch (95 vs 36) phoenix and helicone (94 vs 81)