You're Shipping AI You Can't Measure

1,159 repos are building LLM evaluation infrastructure. Most teams are still eyeballing outputs. Here's the decision guide to what actually works.

Graham Rowe · April 03, 2026 · Updated daily with live data
llm-tools agents rag embeddings prompt-engineering

You shipped an AI feature. It seems to work. Users aren't complaining yet. But when someone asks "how do you know it's good?" the honest answer is: you don't. You're eyeballing outputs, running a few manual tests, and hoping.

You're not alone. PT-Edge tracks 1,159 repos across 9 subcategories building LLM evaluation infrastructure. The sheer number tells the story: everyone knows this problem needs solving, and nobody agrees on how.

Here's what the data says about what's actually production-ready and what's still research.

The eval landscape has five layers

Most developers conflate "evaluation" into one bucket. It's actually five distinct problems, each at a different maturity level:

  1. Output quality — Is the LLM response correct, relevant, and safe?
  2. RAG pipeline — Is the retrieval feeding the right context?
  3. Agent behaviour — Can the agent complete multi-step tasks reliably?
  4. Code generation — Does the generated code actually run and pass tests?
  5. Model comparison — Which model is best for your specific use case and budget?

Each layer has different tools, different maturity, and different traps. Let's walk through each.

Layer 1: Output quality evaluation

The most common starting point — and the most confusing. Three approaches compete:

LLM-as-judge uses a stronger model to grade a weaker one. It's the default because it's easy to set up: send the output to GPT-4 with a rubric, get a score back. The problem is that judges have systematic biases — they prefer longer outputs, favour their own generation patterns, and can't reliably detect subtle factual errors.

Reference-based metrics compare outputs against gold-standard answers using exact match, semantic similarity, or entailment. More reliable but requires a labelled test set, which most teams don't have.

Programmatic checks use deterministic assertions: does the output contain required fields, follow the schema, stay under the length limit, avoid forbidden patterns? Less glamorous but more trustworthy for production monitoring.

ProjectScoreStarsDownloads/moApproach
giskard-oss 70/100 5,158 Modular checks (LLM-judge + deterministic + scenario-based)
prometheus-eval 37/100 1,051 Open-source LLM-as-judge
uptrain 55/100 2,339 2,643 Automated grading with dashboards

Giskard (70/100, 5,158 stars) is the most complete general-purpose eval framework we track. It recently split into lightweight packages (giskard-checks, giskard-scan, giskard-rag) so you only install what you need. It supports both deterministic assertions and LLM-as-judge, which is the right architecture — you want both.

Prometheus (1,051 stars) takes a different approach: open-source models fine-tuned specifically for evaluation. If you're uncomfortable sending outputs to GPT-4 for judging (data privacy, cost), Prometheus lets you run the judge locally.

Our recommendation: Start with programmatic checks for your known requirements (schema validation, length limits, required fields). Add LLM-as-judge for subjective quality. Don't rely on LLM-as-judge alone — it will tell you everything looks great until it suddenly doesn't.

Layer 2: RAG evaluation

If you're building retrieval-augmented generation, the eval problem doubles: you need to measure both retrieval quality (did you fetch the right documents?) and generation quality (did the LLM use them correctly?).

This is the most mature eval subcategory, and there's a clear category leader:

ProjectScoreStarsDownloads/moWhat it measures
ragas 70/100 12,927 Context relevance, faithfulness, answer correctness, noise robustness

RAGAS (70/100, 12,927 stars, 0 downloads/month) defines the RAG eval space. It breaks evaluation into four dimensions: faithfulness (does the answer follow the context?), answer relevance, context precision (how much retrieved context was useful?), and context recall (did you find everything you needed?). It generates synthetic test sets from your production data so you don't need manually labelled examples.

With 0 monthly downloads, RAGAS has genuine production adoption — not just GitHub stars. It integrates with LangChain and major observability platforms.

103 other repos sit in the RAG evaluation subcategory, but none come close to RAGAS in adoption or quality. If you're building RAG, start here.

Layer 3: Agent evaluation

This is where the landscape is most chaotic. Evaluating agents is fundamentally harder than evaluating single LLM calls because agents take multi-step actions with side effects. A coding agent might write correct code that passes tests but introduces a security vulnerability. A research agent might retrieve accurate information but miss critical context.

PT-Edge tracks 150 repos in agent evaluation and benchmarking. Most are academic benchmarks, not production tools:

ProjectScoreStarsFocus
plano 67/100 5,953 Production agent testing with LLM-judged scenarios
AgentBench 55/100 3,234 Multi-environment agent benchmark (OS, DB, web)
OSWorld 72/100 2,664 Real computer environment benchmark (screenshot grounding)
chatarena 41/100 1,540 Multi-agent evaluation via simulated conversations

Plano (5,953 stars) is the closest thing to production agent testing we've found. It lets you define scenarios with expected outcomes and uses LLM judges to score agent performance across dimensions you specify.

The hard truth: agent evaluation is still mostly manual. Most teams deploying agents evaluate them by running scenarios and reviewing traces by hand. The tooling gap here is the widest in the entire eval landscape, and it's where the most building is happening — 150 repos and growing.

Layer 4: Code evaluation

If you're using LLMs to generate code — and in 2026, most development teams are — the eval question is more tractable: does the code run, pass tests, and handle edge cases?

ProjectScoreStarsWhat it tests
evalplus 63/100 1,699 Rigorous code evaluation with augmented test suites (NeurIPS 2023)

EvalPlus (1,699 stars) is the standard for rigorous code eval. It augments HumanEval and MBPP benchmarks with 80x more test cases, catching failures that basic benchmarks miss. The original HumanEval had problems passing code that seemed correct but failed on edge cases — EvalPlus fixes this.

Code evaluation is the most mature layer because the success criterion is binary: does the code pass the tests or not? No LLM-as-judge ambiguity, no subjective rubrics. If your AI feature is code generation, this is the easiest place to build reliable evals.

Layer 5: Model comparison

Which model should you use? The "it depends" answer is unsatisfying but true — and the benchmarking landscape is built to help you decide.

ProjectScoreStarsDownloads/moScope
opencompass 76/100 6,752 100+ benchmarks across language, reasoning, code, safety
VLMEvalKit 72/100 3,894 Vision-language model evaluation (GPT-4V, Gemini, etc.)
lmms-eval 90/100 3,883 9,061 Multimodal: text, image, video, audio
mteb 99/100 3,159 1,555,633 Embedding model benchmarks (the standard for vector search quality)

OpenCompass (76/100, 6,752 stars, 0 downloads/month) is the most comprehensive benchmarking platform. It covers 100+ benchmarks and can evaluate both API models and local models in a unified framework.

For embeddings specifically, MTEB (3,159 stars, 1,555,633 downloads/month) is the undisputed standard. If you're choosing an embedding model, MTEB scores are the metric to compare.

The real problem: nobody has an eval pipeline

The tools exist. The problem is that most teams never wire them together into something that runs automatically. Here's what a real eval pipeline looks like:

  1. Baseline test suite. 50-100 representative inputs with expected outputs. Not comprehensive — just enough to catch regressions. Giskard or RAGAS can generate these from your production data.
  2. CI integration. Run evals on every prompt change, model version change, or RAG pipeline change. Block merges that reduce scores below thresholds.
  3. Production monitoring. Sample live traffic, run async evals, alert on quality degradation. This is where most teams have zero coverage.
  4. Cost tracking. Eval the cost-quality tradeoff: a cheaper model might be 95% as good at 10% of the cost. Without evals, you can't make this decision with data.

The HuggingFace Evaluation Guidebook (2,075 stars) is the best free resource for understanding evaluation theory. It's not a tool — it's the conceptual foundation that helps you decide which tools to use and why.

Where this is heading

Anthropic recently acknowledged that their safety evaluation methods aren't keeping pace with capability improvements. If the frontier labs can't keep up with eval, production teams have no chance without better tooling.

Three trends to watch:

  • Eval-as-infrastructure. Evaluation moving from a researcher activity to a CI/CD primitive. Leva (133 stars) is built specifically for Rails apps with production data — evaluation as a framework feature, not an add-on.
  • Agent eval will mature fast. 150 repos in the agent evaluation subcategory with average quality still low — this is exactly where the amnesia space was 12 months ago. Expect consolidation.
  • LLM-as-judge will get better and cheaper. Prometheus proves that open-source judge models work. As these improve, the cost of running evals drops from dollars to fractions of a cent, making continuous evaluation viable.

What to do right now

If you're shipping an AI feature with no evaluation, here's the minimum viable eval pipeline:

  1. RAG? Install RAGAS. Generate a test set from your production data. Run it on every pipeline change.
  2. General LLM outputs? Start with Giskard for programmatic checks + LLM-as-judge. Define pass/fail thresholds.
  3. Code generation? Use EvalPlus for rigorous test suites.
  4. Choosing a model? Check OpenCompass scores for your task type. For embeddings, use MTEB.
  5. Agents? This is still manual. Log traces, review them regularly, build scenario tests as you find failure modes. Watch Plano for when the tooling matures.

The bar isn't perfection. The bar is knowing whether your AI is getting better or worse. Right now, most teams can't answer that question.

Go deeper

Every project mentioned here has a quality-scored page in our directory, updated daily:

Related analysis