ngpepin/mostsimilar-for-RAG-normalization

Linux CLI tools to compare text files and find nearest neighbours across large directories using TF‑IDF or SimHash, with optional dedup workflows, useful in RAG pipelines to remove duplicate documents that have different MD5/SHA-256/SHA-512 hashes but same/similar contents. C++/C performance.

/ 100

Experimental

No Package No Dependents

Maintenance 10 / 25

Adoption 1 / 25

Maturity 9 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

License

MIT

Category

retrieval-ranking-fusion

Last pushed

Feb 27, 2026

Commits (30d)

GitHub

Retrieval Ranking Fusion · 62 tools

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/ngpepin/mostsimilar-for-RAG-normalization"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

Higher-rated alternatives

beir-cellar/beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across...

superlinear-ai/raglite

🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL

HKUDS/LightRAG

[EMNLP2025] "LightRAG: Simple and Fast Retrieval-Augmented Generation"

illuin-tech/vidore-benchmark

Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.

HKUDS/RAG-Anything

"RAG-Anything: All-in-One RAG Framework"

Explore RAG Tools

All categories Trending RAG directory Insights