VIGINUM-FR/D3lta

A Python implementation of the D3lta algorithm for duplicated textual content detection

/ 100

Emerging

Combines semantic embeddings (Universal Sentence Encoder or custom models) with FAISS-based similarity search and grapheme-level analysis to classify duplicates across three categories: copy-pasta, rewording, and cross-lingual translation. The approach uses configurable thresholds for semantic, linguistic, and character-level matching to balance precision across different duplication types. Includes a synthetic evaluation dataset of 2,985 multilingual documents with 1.5M annotated pairs generated via LLM transformations.

No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

MIT

Higher-rated alternatives

colonelwatch/abstracts-search

Semantic search engine indexing 110 million academic publications

ahr9n/quranic-search-v2

Quranic Lexical/Semantic Search

geetanjaliapp/geetanjali

RAG-powered ethical decision guidance from Bhagavad Geeta. Analyze dilemmas, get structured...

hazemabdelkawy/SunnahGPT

SunnahGPT is a natural language processing (NLP) project aimed at scraping hadith data from the...

mufaizz/FAIZ-AI

FAIZ AI 🔍 – The search bot that finds what others miss. Searches HTTP, FTP, IPFS & Torrent with...

Explore Embedding Tools

All categories Trending Embeddings directory Insights