CPDC-TR-9906-010

MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets

Sanjay Goil, Harsha Nagesh, Alok Choudhary
Abstract

Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding end-results, and sensitivity to input order, have received attention in the recent past. Recent approaches attempt to find clusters embedded in subspaces of high dimensional data. In this paper we propose the use of adaptive grids for efficient and scalable computation of clusters in subspaces for large data sets and large number of dimensions. The bottom-up algorithm for subspace clustering computes the dense units in all dimensions and combines these to generate the dense units in higher dimensions. Computation is heavily dependent on the choice of the partitioning parameter chosen to partition each dimension into intervals (bins) to be tested for density. The number of bins determines the computation requirements and the quality of the clustering results. Hence, it is important to determine the appropriate size and number of the bins. We present MAFIA, which 1) proposes adaptive grids for fast subspace clustering and 2) introduces a scalable parallel framework on a shared-nothing architecture to handle massive data sets. Performance results on very large data sets and a large number of dimensions show very good results, making an order of magnitude improvement in the computation time over current methods and providing much better quality of clustering.

Gzipped Postscript version of the paper