ParaMonte Fortran 2.0.0
Parallel Monte Carlo and Machine Learning Library
See the latest version documentation.
pm_clusKmeans Module Reference

This module contains procedures and routines for the computing the Kmeans clustering of a given set of data. More...

Data Types

interface  setCenter
 Compute and return the centers of the clusters corresponding to the input sample, cluster membership IDs, and sample distances-squared from their corresponding cluster centers.
More...
 
interface  setKmeans
 Compute and return an iteratively-refined set of cluster centers given the input sample using the k-means approach.
More...
 
interface  setKmeansPP
 Compute and return an asymptotically optimal set of cluster centers for the input sample, cluster membership IDs, and sample distances-squared from their corresponding cluster centers.
More...
 
interface  setMember
 Compute and return the memberships and minimum distances of a set of input points with respect to the an input set of cluster centers.
More...
 

Variables

character(*, SK), parameter MODULE_NAME = "@pm_clusKmeans"
 

Detailed Description

This module contains procedures and routines for the computing the Kmeans clustering of a given set of data.

The k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.
This results in a partitioning of the data space into Voronoi cells.
The k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem:
the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances.
For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum.
These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling.
They both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the Gaussian mixture model allows clusters to have different shapes.

Kmeans Algorithm

Given a set of observations \((x_1, x_2, \ldots, x_n)\), where each observation is a \(d\)-dimensional real vector, the k-means clustering aims to partition the \(n\) observations into \(k\) ( \(\leq n\)) sets \(S = \{S_1, S_2, \ldots, S_k\}\) so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance).
Formally, the objective is to find:

\begin{equation} \underset{\mathbf{S}}{\up{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}} \left\|\mathbf{x} -{\boldsymbol{\mu}}_{i}\right\|^{2} = {\underset{\mathbf{S}}{\up{arg\,min}}}\sum_{i=1}^{k}|S_{i}|\up{Var}S_{i} ~, \end{equation}

where \(\mu_i\) is the mean (also called centroid) of points in \(S_{i}\), i.e.

\begin{equation} {\boldsymbol {\mu_{i}}} = {\frac{1}{|S_{i}|}} \sum_{\mathbf{x} \in S_{i}} \mathbf{x} ~, \end{equation}

where \(|S_{i}|\) is the size of \(S_{i}\), and \(\|\cdot\|\) is the \(L^2\)-norm.
This is equivalent to minimizing the pairwise squared deviations of points in the same cluster:

\begin{equation} \underset{\mathbf{S}}{\up{arg\,min}} \sum_{i=1}^{k}\,{\frac {1}{|S_{i}|}}\,\sum_{\mathbf{x}, \mathbf{y} \in S_{i}}\left\|\mathbf{x} - \mathbf{y} \right\|^{2} ~, \end{equation}

The equivalence can be deduced from identity

\begin{equation} |S_{i}|\sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x} -{\boldsymbol{\mu}}_{i}\right\|^{2} = {\frac{1}{2}}\sum _{\mathbf {x} ,\mathbf {y} \in S_{i}}\left\|\mathbf {x} -\mathbf {y} \right\|^{2} ~. \end{equation}

Since the total variance is constant, this is equivalent to maximizing the sum of squared deviations between points in different clusters.
This deterministic relationship is also related to the law of total variance in probability theory.

Kmeans Performance improvements

The k-means++ seeding method yields considerable improvement in the final error of k-means algorithm.
For more information, see the documentation of setKmeansPP.

In data mining, the k-means++ is an algorithm for choosing the initial values (or seeds) for the k-means clustering algorithm.
It was proposed in 2007 by David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the NP-hard k-means problem.
It offers a way of avoiding the sometimes poor clustering found by the standard k-means algorithm.

Kmeans++ Intuition

The intuition behind k-means++ is that spreading out the \(k\) initial cluster centers is a good thing:
The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the closest existing cluster center to the point.

Kmeans++ Algorithm

The exact algorithm is as follows:

  1. Choose one center uniformly at random among the data points.
  2. For each data point \(x\) not chosen yet, compute \(D(x)\), the distance between \(x\) and the nearest center that has already been chosen.
  3. Choose one new data point at random as a new center, using a weighted probability distribution where a point \(x\) is chosen with probability proportional to \(D(x)^2\).
  4. Repeat Steps 2 and 3 until \(k\) centers have been chosen.
  5. Now that the initial centers have been chosen, proceed using standard k-means clustering.

Kmeans++ Performance improvements

The k-means++ seeding method yields considerable improvement in the final error of k-means algorithm.
Although the initial selection in the algorithm takes extra time, the k-means part itself converges very quickly after this seeding and thus the algorithm actually lowers the computation time.
Based on the original paper, the method yields typically 2-fold improvements in speed, and for certain datasets, close to 1000-fold improvements in error.
In these simulations the new method almost always performed at least as well as vanilla k-means in both speed and error.

Test:
test_pm_clusKmeans


Final Remarks


If you believe this algorithm or its documentation can be improved, we appreciate your contribution and help to edit this page's documentation and source file on GitHub.
For details on the naming abbreviations, see this page.
For details on the naming conventions, see this page.
This software is distributed under the MIT license with additional terms outlined below.

  1. If you use any parts or concepts from this library to any extent, please acknowledge the usage by citing the relevant publications of the ParaMonte library.
  2. If you regenerate any parts/ideas from this library in a programming environment other than those currently supported by this ParaMonte library (i.e., other than C, C++, Fortran, MATLAB, Python, R), please also ask the end users to cite this original ParaMonte library.

This software is available to the public under a highly permissive license.
Help us justify its continued development and maintenance by acknowledging its benefit to society, distributing it, and contributing to it.

Author:
Amir Shahmoradi, April 03, 2017, 2:16 PM, Institute for Computational Engineering and Sciences (ICES), University of Texas at Austin

Variable Documentation

◆ MODULE_NAME

character(*, SK), parameter pm_clusKmeans::MODULE_NAME = "@pm_clusKmeans"

Definition at line 120 of file pm_clusKmeans.F90.