MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets
Sanjay Goil, Harsha Nagesh, Alok Choudhary
Abstract
Clustering techniques are used in database mining for finding
interesting patterns in high dimensional data. These are useful in
various applications of knowledge discovery in databases. Some
challenges in clustering for large data sets in terms of scalability,
data distribution, understanding end-results, and sensitivity to input
order, have received attention in the recent past. Recent approaches
attempt to find clusters embedded in subspaces of high dimensional
data. In this paper we propose the use of adaptive grids for efficient
and scalable computation of clusters in subspaces for large data sets
and large number of dimensions. The bottom-up algorithm for subspace
clustering computes the dense units in all dimensions and combines
these to generate the dense units in higher dimensions. Computation is
heavily dependent on the choice of the partitioning parameter chosen
to partition each dimension into intervals (bins) to be tested for
density. The number of bins determines the computation requirements
and the quality of the clustering results. Hence, it is important to
determine the appropriate size and number of the bins. We present
MAFIA, which 1) proposes adaptive grids for fast subspace clustering
and 2) introduces a scalable parallel framework on a shared-nothing
architecture to handle massive data sets. Performance results on very
large data sets and a large number of dimensions show very good
results, making an order of magnitude improvement in the computation
time over current methods and providing much better quality of
clustering.