kmeans (Statistics Toolbox)

K-means clustering

Syntax

IDX = kmeans(X,k)
[IDX,C] = kmeans(X,k)
[IDX,C,sumd] = kmeans(X,k)
[IDX,C,sumd,D] = kmeans(X,k)
[...] = kmeans(...,'param1',val1,'param2',val2,...)

Description

IDX = kmeans(X, k) partitions the points in the n-by-p data matrix X into k clusters. This iterative partitioning minimizes the sum, over all clusters, of the within-cluster sums of point-to-cluster-centroid distances. Rows of X correspond to points, columns correspond to variables. kmeans returns an n-by-1 vector IDX containing the cluster indices of each point. By default, kmeans uses squared Euclidean distances.

[IDX,C] = kmeans(X,k) returns the k cluster centroid locations in the k-by-p matrix C.

[IDX,C,sumd] = kmeans(X,k) returns the within-cluster sums of point-to-centroid distances in the 1-by-k vector sumd.

[IDX,C,sumd,D] = kmeans(X,k) returns distances from each point to every centroid in the n-by-k matrix D.

[...] = kmeans(...,'param1',val1,'param2',val2,...) enables you to specify optional parameter name-value pairs to control the iterative algorithm used by kmeans. Valid parameters are the following.

Parameter

Value

'distance'
Distance measure, in p-dimensional space, that kmeans minimizes with respect to. kmeans computes centroid clusters differently for the different supported distance measures:

'sqEuclidean'

Squared Euclidean distance (default). Each centroid is the mean of the points in that cluster.

'cityblock'
Sum of absolute differences, i.e., L1. Each centroid is the component-wise median of the points in that cluster.

'cosine'
One minus the cosine of the included angle between points (treated as vectors). Each centroid is the mean of the points in that cluster, after normalizing those points to unit Euclidean length.

'correlation'
One minus the sample correlation between points (treated as sequences of values). Each centroid is the component-wise mean of the points in that cluster, after centering and normalizing those points to zero mean and unit standard deviation.

'Hamming'
Percentage of bits that differ (only suitable for binary data). Each centroid is the component-wise median of points in that cluster.

'start'
Method used to choose the initial cluster centroid positions, sometimes known as "seeds". Valid starting values are:

'sample'
Select k observations from X at random (default).

'uniform'
Select k points uniformly at random from the range of X. Not valid with Hamming distance.

'cluster'
Perform a preliminary clustering phase on a random 10% subsample of X. This preliminary phase is itself initialized using 'sample'.

Matrix
k-by-p matrix of centroid starting locations. In this case, you can pass in [] for k, and kmeans infers k from the first dimension of the matrix. You can also supply a 3-dimensional array, implying a value for the 'replicates' parameter from the array's third dimension.

'replicates'
Number of times to repeat the clustering, each with a new set of initial cluster centroid positions. kmeans returns the solution with the lowest value for sumd. You can supply 'replicates' implicitly by supplying a 3-dimensional array as the value for the 'start' parameter.

'maxiter'
Maximum number of iterations. Default is 100.

'emptyaction'
Action to take if a cluster loses all its member observations. Can be one of:

'error'
Treat an empty cluster as an error. (default)

'drop'
Remove any clusters that become empty. kmeans sets the corresponding return values in C and D to NaN.

'singleton'
Create a new cluster consisting of the one point furthest from its centroid.

'display'
Controls display of output.

'off'
Display no output.

'iter'
Display information about each iteration during minimization, including the iteration number, the optimization phase (see Algorithm), the number of points moved, and the total sum of distances.

'final'
Display a summary of each replication.

'notify'
Display only warning and error messages. (default)

Parameter	Value
`'distance'`	Distance measure, in `p`-dimensional space, that `kmeans` minimizes with respect to. `kmeans` computes centroid clusters differently for the different supported distance measures:
	`'sqEuclidean'`	Squared Euclidean distance (default). `E`ach centroid is the mean of the points in that cluster.
	`'cityblock'`	Sum of absolute differences, i.e., L1. `E`ach centroid is the component-wise median of the points in that cluster.
	`'cosine'`	One minus the cosine of the included angle between points (treated as vectors). `E`ach centroid is the mean of the points in that cluster, after normalizing those points to unit Euclidean length.
	`'correlation'`	One minus the sample correlation between points (treated as sequences of values). `E`ach centroid is the component-wise mean of the points in that cluster, after centering and normalizing those points to zero mean and unit standard deviation.
	`'Hamming'`	Percentage of bits that differ (only suitable for binary data). `E`ach centroid is the component-wise median of points in that cluster.
`'start'`	Method used to choose the initial cluster centroid positions, sometimes known as "seeds". Valid starting values are:
	`'sample'`	Select `k` observations from `X` at random (default).
	`'uniform'`	Select `k` points uniformly at random from the range of `X`. Not valid with Hamming distance.
	`'cluster'`	Perform a preliminary clustering phase on a random 10% subsample of `X`. This preliminary phase is itself initialized using 'sample'.
	Matrix	`k`-by-`p` matrix of centroid starting locations. In this case, you can pass in `[]` for `k`, and `kmeans` infers `k` from the first dimension of the matrix. You can also supply a 3-dimensional array, implying a value for the `'replicates'` parameter from the array's third dimension.
`'replicates'`	Number of times to repeat the clustering, each with a new set of initial cluster centroid positions. `kmeans` returns the solution with the lowest value for `sumd`. You can supply `'replicates'` implicitly by supplying a 3-dimensional array as the value for the `'start'` parameter.
`'maxiter'`	Maximum number of iterations. Default is 100.
`'emptyaction'`	Action to take if a cluster loses all its member observations. Can be one of:
	`'error'`	Treat an empty cluster as an error. (default)
	`'drop'`	Remove any clusters that become empty. `kmeans` sets the corresponding return values in `C` and `D` to `NaN`.
	`'singleton'`	Create a new cluster consisting of the one point furthest from its centroid.
`'display'`	Controls display of output.
	`'off'`	Display no output.
	`'iter'`	Display information about each iteration during minimization, including the iteration number, the optimization phase (see Algorithm), the number of points moved, and the total sum of distances.
	`'final'`	Display a summary of each replication.
	`'notify'`	Display only warning and error messages. (default)

Algorithm

kmeans uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all k clusters:

The first phase uses what the literature often describes as "batch" updates, where each iteration consists of reassigning points to their nearest cluster centroid, all at once, followed by recalculation of cluster centroids. You can think of this phase as providing a fast but potentially only approximate solution as a starting point for the second phase.
The second phase uses what the literature often describes as "on-line" updates, where points are individually reassigned if doing so will reduce the sum of distances, and cluster centroids are recomputed after each reassignment. Each iteration during this second phase consists of one pass though all the points.

kmeans can converge to a local optimum, in this case, a partition of points in which moving any single point to a different cluster increases the total sum of distances. This problem can only be solved by a clever (or lucky, or exhaustive) choice of starting points.

See Also

clusterdata, linkage, silhouette

References

[1] Seber, G.A.F., Multivariate Observations, Wiley, New York, 1984.

[2] Spath, H., Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples, translated by J. Goldschmidt, Halsted Press, New York, 1985, 226 pp.

jbtest kruskalwallis