derrickburns/generalized-kmeans-clustering

Production-ready K-Means clustering for Apache Spark with pluggable Bregman divergences (KL, Itakura-Saito, L1, etc). 6 algorithms, 740 tests, cross-version persistence. Drop-in replacement for MLlib with mathematically correct distance functions for probability distributions, spectral data, and count data.

/ 100

Established

Implements 6 clustering variants (Bisecting, X-Means, Soft/Fuzzy, Streaming, K-Medians, K-Medoids) with pluggable divergence kernels via a pure DataFrame/Spark ML API following the Estimator/Model pattern, enabling seamless pipeline integration. Supports cross-version persistence across Scala 2.12↔2.13 and Spark 3.4↔4.0, with comprehensive test coverage including kernel accuracy validation and determinism checks. Available on Maven Central, PyPI, and Databricks, supporting Scala/SBT, spark-submit, and PySpark workflows.

342 stars.

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 20 / 25

How are scores calculated?

Stars

342

Forks

Language

Scala

License

Apache-2.0

Related tools

TorchDR/TorchDR

TorchDR - PyTorch Dimensionality Reduction

abhilash1910/ClusterTransformer

Topic clustering library built on Transformer embeddings and cosine similarity...

md-experiments/picture_text

Interactive tree-maps with SBERT & Hierarchical Clustering (HAC)

nlpub/watset-java

An implementation of the Watset clustering algorithm in Java.

mainlp/semantic_components

Finding semantic components in your neural representations.

Explore Embedding Tools

All categories Trending Embeddings directory Insights