gabe-mousa/Apolien

AI Safety Evaluation Library

/ 100

Emerging

Implements chain-of-thought faithfulness evaluation by intervening in model reasoning steps to detect post-hoc explanations versus genuine reasoning, supporting local models via Ollama and cloud APIs (Claude, OpenAI) through a unified evaluator interface. Runs safety tests across configurable datasets with file-based logging of conversation transcripts and results. Currently focuses on faithfulness detection with extensible architecture for additional safety metrics like sycophancy and deception.

Available on PyPI.

Maintenance 6 / 25

Adoption 9 / 25

Maturity 18 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

microsoft/promptbench

A unified evaluation framework for large language models

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications....

microsoftarchive/promptbench

A unified evaluation framework for large language models

babelcloud/LLM-RGB

LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.

PromptMixerDev/prompt-mixer-app-ce

A desktop application for comparing outputs from different Large Language Models (LLMs).

Explore Prompt Engineering Tools

All categories Trending Prompt Engineering directory Insights