VIGINUM-FR/D3lta
A Python implementation of the D3lta algorithm for duplicated textual content detection
Combines semantic embeddings (Universal Sentence Encoder or custom models) with FAISS-based similarity search and grapheme-level analysis to classify duplicates across three categories: copy-pasta, rewording, and cross-lingual translation. The approach uses configurable thresholds for semantic, linguistic, and character-level matching to balance precision across different duplication types. Includes a synthetic evaluation dataset of 2,985 multilingual documents with 1.5M annotated pairs generated via LLM transformations.
No commits in the last 6 months.
Stars
58
Forks
8
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Jul 31, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/VIGINUM-FR/D3lta"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
colonelwatch/abstracts-search
Semantic search engine indexing 110 million academic publications
ahr9n/quranic-search-v2
Quranic Lexical/Semantic Search
geetanjaliapp/geetanjali
RAG-powered ethical decision guidance from Bhagavad Geeta. Analyze dilemmas, get structured...
hazemabdelkawy/SunnahGPT
SunnahGPT is a natural language processing (NLP) project aimed at scraping hadith data from the...
mufaizz/FAIZ-AI
FAIZ AI 🔍 – The search bot that finds what others miss. Searches HTTP, FTP, IPFS & Torrent with...