MSCI11, Project 2
Clustering in Data Mining

Project Outline

Cluster Analysis or Clustering/Segmentation is an important method used in a number of Computing areas as Computational Statistics, Machine Learning, and Data Mining, that is applied to various problems specific or related to those areas. The underlying idea in clustering consists in the fact that similar objects need to be grouped together in groups called clusters, while dissimilar objects should be places in distinct clusters. Clustering has an impressively large applicability, ranging from the market research (including the problem of automatically grouping together similar current or potential customers of a business in order to better tailor the services offered to them), to medical imaging and cancer research, computational biology and bioinformatics, educational research, multimedia information categorisation, and so forth.

Clustering in Data Mining has benefitted from the previous results obtained via long time research efforts in Statistics and Machine Learning. The techniques in the two areas are employed as they are, and have recently been completed with new Data Mining algorithms that fulfil one of the most important challenging requirements of this relatively new filed in Computer Science, namely scalability. In order to make the problem of clustering tractable when it comes to processing large volumes of data, clustering algorithms must scale well.

The following themes related to Cluster Analysis are proposed, of which you are to choose one for your mini-project.

Theme 1: An in depth Review of an Efficient Scalable Segmentation Technique

The project is based on assimilating very recent research results in clustering which describe a scalable framework for cluster ensembles, consisting in one of the efficient techniques capable to make the problem of clustering very large datasets tractable. The project consists in writing a report describing the methods of this framework introduced in [Hore 2009], and placing these methods in the context of the clustering techniques you are familiar with Ð see Data Mining course website [CIS338] and [Han2006]. Your report should contain also a section that encompasses the main ideas that this work suggested you, related to a possible continuation of the research presented in [Hore2009].

Theme 2: Research and Implementation of a new Clustering Scalable Algorithm

This project is based on a research work consisting in the study of a new centroid based scalable clustering algorithm, and its Java implementation and testing on a battery of datasets. The material reflecting research in progress is available from and will be discussed in detail with the project supervisor. The report to submit will consist in a description of the algorithm, its Java implementation, and its evaluation on various datasets accessible from links provided in [CIS338].

References

[CIS338] Data Mining, Daniel Stamate, Department of Computing, Goldsmiths College
[Han 2006] Data Mining, Han et al., 2006 [Chapter 6: Cluster Analysis]
[Hore 2009] A scalable framework for cluster ensembles, Hore et al., Pattern Recognition 42, 2009.