ChenghaoMou/text-dedup
All-in-one text de-duplication
Provides four deduplication algorithms—MinHash with LSH for near-duplicates, SimHash for semantic similarity, Bloom filters for exact matches, and suffix arrays for substring deduplication—all configured via TOML. Uses a modular architecture supporting multiple input formats (parquet, local files) with configurable thresholds, hash parameters, and merge strategies. Includes comprehensive benchmarks on public datasets (CORE, NEWS-COPY) demonstrating competitive performance against embedding-based and hybrid approaches.
746 stars and 695 monthly downloads. Actively maintained with 2 commits in the last 30 days. Available on PyPI.
Stars
746
Forks
75
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 09, 2026
Monthly downloads
695
Commits (30d)
2
Dependencies
15
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ChenghaoMou/text-dedup"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
loretoparisi/fasttext.js
FastText for Node.js
messense/fasttext-serving
fastText model serving service
vrasneur/pyfasttext
Yet another Python binding for fastText
olegtarasov/FastText.NetWrapper
.NET Standard wrapper for fastText library. Now works on Windows, Linux and MacOs!
winkjs/wink-jaro-distance
An Implementation of Jaro Distance Algorithm by Matthew A. Jaro