VIGINUM-FR/D3lta

A Python implementation of the D3lta algorithm for duplicated textual content detection

40
/ 100
Emerging

Combines semantic embeddings (Universal Sentence Encoder or custom models) with FAISS-based similarity search and grapheme-level analysis to classify duplicates across three categories: copy-pasta, rewording, and cross-lingual translation. The approach uses configurable thresholds for semantic, linguistic, and character-level matching to balance precision across different duplication types. Includes a synthetic evaluation dataset of 2,985 multilingual documents with 1.5M annotated pairs generated via LLM transformations.

No commits in the last 6 months.

Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 8 / 25
Maturity 16 / 25
Community 14 / 25

How are scores calculated?

Stars

58

Forks

8

Language

Jupyter Notebook

License

MIT

Last pushed

Jul 31, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/VIGINUM-FR/D3lta"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.