Statistics Toolbox    

Overview

One of the most important goals in visualizing data is to get a sense of how near or far points are from each other. Often, you can do this with a scatter plot. However, for some analyses, the data that you have may not be in the form of "points" at all, but rather in the form of pairwise similarities or dissimilarities between cases, observations, or subjects. Without any points, you cannot make a scatter plot.

Even if your data are in the form of points rather than pairwise distances, a scatter plot of those data may not be useful. For some kinds of data, the relevant way to measure how "near" two points are may not be their Euclidean distance. While scatter plots of the raw data make it easy to compare Euclidean distances, they are not always useful when comparing other kinds of interpoint distances, city block distance for example, or even more general dissimilarities. Also, with a large number of variables, it is very difficult to visualize distances unless the data can be represented in a small number of dimensions. Some sort of dimension reduction is usually necessary.

Multidimensional Scaling (MDS) is a set of methods that address all of these problems. MDS allows you to visualize how "near" points are to each other for many kinds of distance or dissimilarity measures, and can produce a representation of your data in a small number of dimensions. MDS does not require raw data, but only a matrix of pairwise distances or dissimilarities.

The function cmdscale performs classical (metric) multidimensional scaling, also known as Principal Coordinates Analysis. cmdscale takes as an input a matrix of interpoint distances, and creates a configuration of points. Ideally, those points are in two or three dimensions, and the Euclidean distances between them reproduce the original distance matrix. Thus, a scatter plot of the points created by cmdscale will provide a visual representation of the original distances.

A Simple Example

As a very simple example, you can reconstruct a set of points from only their interpoint distances. First, create some four dimensional points with a small component in their fourth coordinate, and reduce them to distances.

Next, use cmdscale to find a configuration with those interpoint distances. cmdscale accepts distances as either a square matrix, or, as in this example, in the vector upper-triangular form produced by pdist.

cmdscale produces two outputs. The first output, Y, is a matrix containing the reconstructed points. The second output, eigvals, is a vector containing the sorted eigenvalues of what is often referred to as the "scalar product matrix", which, in the simplest case, is equal to Y*Y'. The relative magnitudes of those eigenvalues indicate the relative contribution of the corresponding columns of Y in reproducing the original distance matrix D with the reconstructed points.

If eigvals contains only positive and zero (within roundoff error) eigenvalues, then the columns of Y corresponding to the positive eigenvalues provide an exact reconstruction of D, in the sense that their interpoint Euclidean distances, computed using pdist for example, are identical (within roundoff) to the values in D.

If two or three of the eigenvalues in eigvals are much larger than the rest, then the distance matrix based on the corresponding columns of Y nearly reproduces the original distance matrix D. In this sense, those columns form a lower-dimensional representation that adequately describes the data. However it is not always possible to find a good low-dimensional reconstruction.

The reconstruction in three dimensions reproduces D very well, but the reconstruction in two dimensionss has errors that are the same order of magnitude as the largest values in D.

Often, eigvals contains some negative eigenvalues, indicating that the distances in D cannot be reproduced exactly. That is, there may not be any configuration of points whose interpoint Euclidean distances are given by D. If the largest negative eigenvalue is small in magnitude relative to the largest positive eigenvalues, then the configuration returned by cmdscale may still reproduce D well. The following example demonstrates this.


  Classical Multidimensional Scaling Reconstructing a Map from Inter-City Distances