ChenghaoMou/text-dedup

All-in-one text de-duplication

76
/ 100
Verified

Provides four deduplication algorithms—MinHash with LSH for near-duplicates, SimHash for semantic similarity, Bloom filters for exact matches, and suffix arrays for substring deduplication—all configured via TOML. Uses a modular architecture supporting multiple input formats (parquet, local files) with configurable thresholds, hash parameters, and merge strategies. Includes comprehensive benchmarks on public datasets (CORE, NEWS-COPY) demonstrating competitive performance against embedding-based and hybrid approaches.

746 stars and 695 monthly downloads. Actively maintained with 2 commits in the last 30 days. Available on PyPI.

Maintenance 16 / 25
Adoption 17 / 25
Maturity 25 / 25
Community 18 / 25

How are scores calculated?

Stars

746

Forks

75

Language

Python

License

Apache-2.0

Last pushed

Mar 09, 2026

Monthly downloads

695

Commits (30d)

2

Dependencies

15

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ChenghaoMou/text-dedup"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.