derrickburns/generalized-kmeans-clustering
Production-ready K-Means clustering for Apache Spark with pluggable Bregman divergences (KL, Itakura-Saito, L1, etc). 6 algorithms, 740 tests, cross-version persistence. Drop-in replacement for MLlib with mathematically correct distance functions for probability distributions, spectral data, and count data.
Implements 6 clustering variants (Bisecting, X-Means, Soft/Fuzzy, Streaming, K-Medians, K-Medoids) with pluggable divergence kernels via a pure DataFrame/Spark ML API following the Estimator/Model pattern, enabling seamless pipeline integration. Supports cross-version persistence across Scala 2.12↔2.13 and Spark 3.4↔4.0, with comprehensive test coverage including kernel accuracy validation and determinism checks. Available on Maven Central, PyPI, and Databricks, supporting Scala/SBT, spark-submit, and PySpark workflows.
342 stars.
Stars
342
Forks
53
Language
Scala
License
Apache-2.0
Category
Last pushed
Feb 14, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/derrickburns/generalized-kmeans-clustering"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
TorchDR/TorchDR
TorchDR - PyTorch Dimensionality Reduction
abhilash1910/ClusterTransformer
Topic clustering library built on Transformer embeddings and cosine similarity...
md-experiments/picture_text
Interactive tree-maps with SBERT & Hierarchical Clustering (HAC)
nlpub/watset-java
An implementation of the Watset clustering algorithm in Java.
mainlp/semantic_components
Finding semantic components in your neural representations.