silhouette (Statistics Toolbox)

Silhouette plot for clustered data

Syntax

silhouette(X,clust)
s = silhouette(X,clust)
[s,h] = silhouette(X,clust)
[...] = silhouette(X,clust,distance)
[...] = silhouette(X,clust,distfun,p1,p2,...)

Description

silhouette(X,clust) plots cluster silhouettes for the n-by-p data matrix X, with clusters defined by clust. Rows of X correspond to points, columns correspond to coordinates. clust can be a numeric vector containing a cluster index for each point, or a character matrix or cell array of strings containing a cluster name for each point. silhouette treats NaNs or empty strings in clust as missing values, and ignores the corresponding rows of X. By default, silhouette uses the squared Euclidean distance between points in X.

s = silhouette(X,clust) returns the silhouette values in the n-by-1 vector s, but does not plot the cluster silhouettes.

[s,h] = silhouette(X,clust) plots the silhouettes, and returns the silhouette values in the n-by-1 vector s, and the figure handle in h.

[...] = silhouette(X,clust,distance) plots the silhouettes using the inter-point distance measure specified in distance. Choices for distance are:

'Euclidean'
Euclidean distance

'sqEuclidean'
Squared Euclidean distance (default)

'cityblock'
Sum of absolute differences, i.e., L1

'cosine'
One minus the cosine of the included angle between points (treated as vectors)

'correlation'
One minus the sample correlation between points (treated as sequences of values)

'Hamming'
Percentage of coordinates that differ

'Jaccard'
Percentage of non-zero coordinates that differ

Vector
A numeric distance matrix in upper triangular vector form, such as is created by pdist. X is not used in this case, and can safely be set to [].

`'Euclidean'`	Euclidean distance
`'sqEuclidean'`	Squared Euclidean distance (default)
`'cityblock'`	Sum of absolute differences, i.e., L1
`'cosine'`	One minus the cosine of the included angle between points (treated as vectors)
`'correlation'`	One minus the sample correlation between points (treated as sequences of values)
`'Hamming'`	Percentage of coordinates that differ
`'Jaccard'`	Percentage of non-zero coordinates that differ
`Vector`	A numeric distance matrix in upper triangular vector form, such as is created by `pdist. X` is not used in this case, and can safely be set to `[]`.

[...] = silhouette(X,clust,distfun,p1,p2, ...) accepts a distance function of the form

```
d = distfun(X0,X,p1,p2,...)
```

where X0 is a 1-by-p point, X is an n-by-p matrix of points, and p1,p2,... are optional additional arguments. The function distfun returns an n-by-1 vector d of distances between X0 and each point (row) in X. The arguments p1, p2,... are passed directly to the function distfun.

Remarks

The silhouette value for each point is a measure of how similar that point is to points in its own cluster compared to points in other clusters, and ranges from -1 to +1. It is defined as

S(i) = (min(b(i,:),2) - a(i)) ./ max(a(i),min(b(i,:),2))

where a(i) is the average distance from the ith point to the other points in its cluster, and b(i,k) is the average distance from the ith point to points in another cluster k.

Examples

X = [randn(10,2)+ones(10,2);
     randn(10,2)-ones(10,2)];
cidx = kmeans(X,2,'distance','sqeuclid');
s = silhouette(X,cidx,'sqeuclid');

See Also

dendrogram, kmeans, linkage, pdist

References

[1] Kaufman L. and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, 1990

signtest skewness