Statistics Toolbox | ![]() ![]() |
Syntax
IDX = kmeans(X,k) [IDX,C] = kmeans(X,k) [IDX,C,sumd] = kmeans(X,k) [IDX,C,sumd,D] = kmeans(X,k) [...] = kmeans(...,'param1',val1,'param2',val2,...)
Description
IDX = kmeans(X, k)
partitions the points in the n
-by-p
data matrix X
into k
clusters. This iterative partitioning minimizes the sum, over all clusters, of the within-cluster sums of point-to-cluster-centroid distances. Rows of X
correspond to points, columns correspond to variables. kmeans
returns an n
-by-1
vector IDX
containing the cluster indices of each point. By default, kmeans
uses squared Euclidean distances.
[IDX,C] = kmeans(X,k)
returns the k
cluster centroid locations in the k
-by-p
matrix C
.
[IDX,C,sumd] = kmeans(X,k)
returns the within-cluster sums of point-to-centroid distances in the 1
-by-k
vector sumd
.
[IDX,C,sumd,D] = kmeans(X,k)
returns distances from each point to every centroid in the n
-by-k
matrix D
.
[...] = kmeans(...,'param1',val1,'param2',val2,...)
enables you to specify optional parameter name-value pairs to control the iterative algorithm used by kmeans
. Valid parameters are the following.
Parameter |
Value | |
'distance' |
Distance measure, in p -dimensional space, that kmeans minimizes with respect to. kmeans computes centroid clusters differently for the different supported distance measures: | |
|
'sqEuclidean' |
Squared Euclidean distance (default). |
|
'cityblock' |
Sum of absolute differences, i.e., L1. E ach centroid is the component-wise median of the points in that cluster. |
|
'cosine' |
One minus the cosine of the included angle between points (treated as vectors). E ach centroid is the mean of the points in that cluster, after normalizing those points to unit Euclidean length. |
|
'correlation' |
One minus the sample correlation between points (treated as sequences of values). E ach centroid is the component-wise mean of the points in that cluster, after centering and normalizing those points to zero mean and unit standard deviation. |
|
'Hamming' |
Percentage of bits that differ (only suitable for binary data). E ach centroid is the component-wise median of points in that cluster. |
'start' |
Method used to choose the initial cluster centroid positions, sometimes known as "seeds". Valid starting values are: | |
|
'sample' |
Select k observations from X at random (default). |
|
'uniform' |
Select k points uniformly at random from the range of X . Not valid with Hamming distance. |
|
'cluster' |
Perform a preliminary clustering phase on a random 10% subsample of X . This preliminary phase is itself initialized using 'sample'. |
|
Matrix |
k -by-p matrix of centroid starting locations. In this case, you can pass in [] for k , and kmeans infers k from the first dimension of the matrix. You can also supply a 3-dimensional array, implying a value for the 'replicates' parameter from the array's third dimension. |
'replicates' |
Number of times to repeat the clustering, each with a new set of initial cluster centroid positions. kmeans returns the solution with the lowest value for sumd . You can supply 'replicates' implicitly by supplying a 3-dimensional array as the value for the 'start' parameter. | |
'maxiter' |
Maximum number of iterations. Default is 100. | |
'emptyaction' |
Action to take if a cluster loses all its member observations. Can be one of: | |
|
'error' |
Treat an empty cluster as an error. (default) |
|
'drop' |
Remove any clusters that become empty. kmeans sets the corresponding return values in C and D to NaN . |
|
'singleton' |
Create a new cluster consisting of the one point furthest from its centroid. |
'display' |
Controls display of output. | |
|
|
Display no output. |
|
|
Display information about each iteration during minimization, including the iteration number, the optimization phase (see Algorithm), the number of points moved, and the total sum of distances. |
|
|
Display a summary of each replication. |
|
|
Display only warning and error messages. (default) |
Algorithm
kmeans
uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all k
clusters:
kmeans
can converge to a local optimum, in this case, a partition of points in which moving any single point to a different cluster increases the total sum of distances. This problem can only be solved by a clever (or lucky, or exhaustive) choice of starting points.
See Also
clusterdata
, linkage
, silhouette
References
[1] Seber, G.A.F., Multivariate Observations, Wiley, New York, 1984.
[2] Spath, H., Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples, translated by J. Goldschmidt, Halsted Press, New York, 1985, 226 pp.
![]() | jbtest | kruskalwallis | ![]() |