Consensus clustering on heterogeneous distributed data

Number of pages: 120 File Format: word File Code: 31063
Year: 2016 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Consensus clustering on heterogeneous distributed data

    Computer engineering master thesis - software orientation

    Abstract

    Clustering can be considered one of the most important steps in data analysis. Many clustering methods have been developed and presented so far. One of these methods that has been studied in recent studies is consensus clustering method. The purpose of consensus clustering is to combine several initial clusterings and obtain a final clustering in such a way that the clusters have a higher quality than the clusters in the initial clusterings.

    In this thesis, we will present a process for performing consensus clustering on heterogeneous distributed data, which consists of three steps. In the first stage, peer-to-peer clusters are identified in the initial clustering. In the second stage, each of the primary clusters are weighted. In the third stage, consensus clustering is done according to the weight assigned to the clusters.

    In this thesis, the proposed process has been evaluated on heterogeneously distributed data. The results of the evaluations have also been compared with 4 other algorithms in the field of consensus clustering. The comparisons made show that the proposed process is more efficient than other algorithms in most cases. We discuss basic concepts such as data mining and clustering. Then the problem of consensus clustering is raised using an example. After that, the generalities of the research conducted in this thesis and the results obtained in the evaluations will be presented. Advanced computer systems face a huge amount of data that is collected from various sources such as point of sale terminals [1], bank transactions, electronic commerce, credit cards and satellites. Therefore, due to the increasing volume of data, the need for a process to analyze and extract the knowledge hidden in them is felt more than before.

    Data mining [2] in a simple definition is a process to discover knowledge from large data sets. In many cases, the term data mining is used synonymously with the expression of discovering knowledge from data [3], but in fact data mining is one of the main stages of discovering knowledge. Figure 1-1. It shows the process of discovering knowledge from data, and as it is known, this process includes an iterative sequence of the following steps [37]:

    1. Data cleaning[4]

    2. Data integration[5]

    3. Data selection[6]

    4. Data transformation[7]

    5. Data mining

    6. Model evaluation [10]

    7. Presentation of knowledge

    Data mining methods

    The tasks performed in data mining can be classified into two descriptive [11] and predictive [12] groups. Descriptive activities can reveal the main characteristics of the data in the database. Predictive activities also perform inferential actions on the available data in order to make predictions. Some of the most important data mining methods are: classification [13], clustering [14], discovery of association rules [15] and detection of outliers [16]. Among the proposed methods, clustering and discovering associative rules are a descriptive activity and categorization and recognition of remote data are considered a predictive activity. 1-4-Clustering Cluster analysis is one of the most important human activities. As children, we learn how to differentiate between dogs and cats, or animals and plants, by subconsciously improving clustering in our minds [30]. Clustering is often considered as the first step and one of the most important methods of data analysis. Clustering is a process in which objects are grouped into groups of similar objects. Each group or cluster includes objects that are similar to each other and different from the objects of other groups. Clustering is a form of data modeling that is rooted in mathematics and statistics [8]. Unlike classification, which is a supervised learning method [17], clustering is considered an unsupervised learning method [18], because the data in classification has a class label [19], but in clustering, the class label is specific to the data.Unlike classification, which is a supervised learning method [17], clustering is considered an unsupervised learning method [18], because the data in classification has a class label [19], but in clustering, the class label for the data is not known. The goal in clustering is to minimize the data distance within the cluster and maximize the data distance between different clusters, and therefore it is considered a kind of optimization problem. Sometimes the terms segmentation[20] and segmentation[21] are also considered synonymous with clustering in research.

    Humans are not able to discover knowledge from the mass of information that is in databases without using summarization methods. Basic statistics (such as mean and variance) or frequency comparison charts [22] provide basic and little information about the data. But cluster analysis or clustering can discover more complex relationships between data objects, between specific attributes of data, or between the two [61]. Clustering has wide applications in artificial intelligence, biology, customer relationship management [23], data mining, machine learning, marketing, medicine, pattern recognition, information retrieval, and image processing. For example, in biology, clustering can create a classification of different species based on the characteristics of organisms. Another application of clustering is to better understand the function of genes in the biological processes of a cell [61]. In business, clustering helps marketers discover different groups of customers based on their buying patterns. Clustering can be used to identify groups of houses in a city according to the type of house, value and geographic location, as well as to identify groups of car insurance policy holders with high average cost. Clustering can also be used to group search engine results on the web. Figure 1-2 shows a two-dimensional drawing of the location of customers in a city, which is obtained from the clustering of information about the customers of a store [1]. There developed many clustering algorithms so far. One of the latest proposed and favored methods is consensus clustering. In this method, the goal is to combine some clusterings and reach a clustering in which the quality is more, compared to the input clusterings.

    We propose a consensus clustering method which works on distributed, heterogeneous data. This process has three phases. First, we identify the correspondent clusterings among the input clusterings. At the second phase, we assign a weight to each cluster using Davies-Bouldin index. And finally, consensus clustering is performed according to the assigned weights.

    We also evaluate our proposed method. The results are compared with 4 other prominent consensus clustering algorithms. This comparison certifies that our method reaches better results most of the times.

  • Contents & References of Consensus clustering on heterogeneous distributed data

    List:

    Abstract 1

    Chapter 1 Introduction 2

    1-1- Introduction 3

    1-2- Data mining 3

    1-3- Data mining methods 4

    1-4- Clustering 5

    1-5- Consensual clustering 9

    1-6- The research conducted in the thesis 12

    1-7- Results obtained 13

    1-8- The structure of the thesis 13

    Chapter two, an overview of the work done 14

    2-1- Introduction 15

    2-2- Clustering methods 15

    2-2-1- Segmentation methods 17

     

    2-2-2- Hierarchical methods 19

    2-2-3- K-Means clustering algorithm 19

    2-3- Consensus clustering 22

    2-3-1- Motives for using consensus clustering 23

    2-3-2- Clustering problem Consensus: providing an example 25

    2-3-3- An overview of consensus clustering methods 26

    2-3-4- Grouping consensus clustering methods 27

    2-3-5- Similarity-based methods 31

    Two-by-two similarity (correlation matrix) 31

    Graph-based 35

    2-3-6- consensus methods using mutual information 39

    2-3-7- consensus methods using a hybrid model 40

    2-3-8- vote-based consensus methods 42

    2-4- methods of generating clustering community 46

    2-5- chapter summary 49

    Chapter 3 Presentation of the proposed solution: Consensus clustering on heterogeneous distributed data 51

    3-1- Introduction 52

    3-2- Proposed solution 53

    3-2-1- Identifying the similarity of clusters 53

    3-2-2- Weighted clustering 60

    3-2-3- Consensual clustering on heterogeneous distributed data 64

    3-3- Production of clustering community 67

    3-4- Summary of chapter 68

    Chapter 4 Implementation of the proposed solution and its evaluation results 70

    4-1- Introduction 71

    4-2- Evaluation criteria 71

    4-2-1- Accuracy criterion 72

    4-2-2- Davies-Bouldin index 73

    4-2-3- Rand index73

    4-2-4- Average normalized bilateral information (ANMI) 75

    4-3- Implementation 76

    4-4- Data sets 76

    4-5- Evaluation results78

    4-5-1- Accuracy criterion 78

    4-5-2- Davies-Bouldin index81

    4-5-3- Rand index 83

    4-5-4- Average normalized bilateral information (ANMI) 85

    4-6- Summary of chapter 87

    Chapter Five Conclusion and future works 88

    5-1- Introduction 89

    5-2- Conclusion 89

    5-3- Future works 92

    References 94

    Appendix A: List of abbreviations 100

    Appendix B: English to Persian dictionary 101

    Appendix C: Persian to Persian dictionary English 107

     

     

    Source:

    [1]

    Agarwal, P. K., Har-Peled, S., & Yu, H.  2013. Embeddings of surfaces, curves, and moving points in Euclidean space. SIAM Journal on Computing, 42(2), 442-458.

    [2]

    Alam, S., Dobbie, G., Koh, Y. S., & Riddle, P. 2013, April, Clustering heterogeneous web usage data using hierarchical particle swarm optimization, In Swarm Intelligence (SIS), 2013 IEEE Symposium on (pp. 147-154). IEEE.

    [3]

    Al-Zoubi, M. B., Hudaib A., Huneiti A. and Hammo B. 2008. New Efficient Strategy to Accelerate k-Means Clustering Algorithm. American Journal of Applied Sciences. 5:1247-1250

    [4]

    Amig?, E., Gonzalo, J., Artiles, J. and Verdejo, F. 2008. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Journal of Information Retrieval. Springer.

    [5]

    Arthur, D. and Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms. p:1027-1035.

    [6]

    Ayad, H. G. 2008. Voting-Based Consensus of Data Partitions. PhD Thesis (In University of Waterloo).

    [7]

    Ayad, H. G. and Kamel, M. S. 2005. Cluster-based cumulative ensembles. In Multiple Classifier Systems: Sixth International Workshop, MCS 2005. Seaside, CA, USA. p:236–245.

    [8]

    Belghini, N., Zarghili, A., Kharroubi, J., & Majda, A. 2011, January. Sparse Random Projection and Dimensionality ReductionSparse Random Projection and Dimensionality Reduction Applied on Face Recognition. In The Proceedings of International Conference on Intelligent Systems & Data Processing (pp. 78-82).

    [9]

    Berkhin, P. 2006. Survey on Clustering Data Mining Techniques. Grouping Multidimensional Data. Springer. p:25-71.

    [10]

    Boulis, C. and Ostendorf, M. 2004. Combining multiple clustering systems. In The 8th European conference on Principles and Practice of Knowledge Discovery in Databases(PKDD), LNAI 3202. p:63–74.

    [11]

    Chunsheng, H., Qian, C., Haiyuan, W. and Wada, T. 2008. RK-Means Clustering: K-Means with Reliability. IEICE transactions on information and systems. 91(1):96-104.

     

    [12]

    David, G. and Thomas, H. 2005. Non-redundant clustering with conditional ensembles. The 11th ACM SIGKDD international conference on knowledge discovery in data mining. p:70-77.

    [13]

    Dimitriadou, E., Weingessel, A. and Hornik, K. 2002. A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence. 16:901–912.

    [14]

    Domeniconi, C. and Al-Razgan, M. 2007. Weighted Cluster Ensembles: Methods and Analysis. Technical Report ISE-TR-07-06.

    [15]

    Domininique, V., Abdi, H., Williams, L. J., Bennani?Dosse, M. 2012. Statis and disstatis: optimum multitable principal component analysis and three way metric multidimensional scaling. Wiley Interdisciplinary Reviews: Computational Statistics, 4(2), 124-167.

    [16]

    Duda, R. O., Hart, P. E., & Stork, D. G. 2012. Pattern classification. John Wiley & Sons.

    [17]

    Dudoit, S. and Fridlyand, J. 2003. Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 19(9):1090-1099

    [18]

    Elkan, C. 2003. Using the triangle inequality to accelerate k-means. Proceedings of the 20th International Conference on Machine Learning (ICML-2003).

    [19]

    Fischer, B. and Buhmann, J. M. 2003. Bagging for path-based clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 25:1411–1415.

    [20]

    Fred, A. 2001. Finding consistent clusters in data partitions. The Second International Workshop on Multiple Classifier Systems. Springer-Verlag. p:309-318.

    [21]

    Fred, A. and Jain, K. A. 2002. Evidence Accumulation Clustering Based on the K-Means Algorithm. The Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition. Springer-Verlag. p:442-451.

    [22]

    Gasieniec, L., Jansson, J. and Lingas, A. 2004. Approximation algorithms for Hamming clustering problems. Journal of Discrete Algorithms. Elsevier. 2:289-301

    [23]

    Gionis, A., Mannila, H. and, Tsaparas, P. 2005. Clustering Aggregation. In Proceedings of Twenty-first International Conference on Data Engineering (ICDE). p:341-352.

    [24]

    Guillaume, R., & Mouaddib, N. 2002. SAINTETIQ: a fuzzy set-based approach to database summarization. Fuzzy sets and systems, 129(2), 137-162.

    [25]

    Gondek, D. and Hofmann, T. 2004. Non-redundant data clustering. In Proceedings of the Fourth IEEE International Conference on Data Mining. p:75–82.

    [26]

    Gordon, A. D. and Vichi, M. 2001. Fuzzy partition models for fitting a set of partitions. Psychometrika. 66:229–248.

    [27]

    Greene, D., Tsymbal A., Bolshakova, N. and Cunningham P. 2004. Ensemble Clustering in Medical Diagnostics. Proceedings of the 17th IEEE Symposium on Computer-Based Medical Systems. p:576-581.

    [28]

    Gupta, M., & Han, J. 2011, Heterogeneous network-based trust analysis: a survey, ACM SIGKDD Explorations Newsletter, 13(1), 54-71.

    [29]

    Halkidi, M., Batistakis, Y. and Vazirgiannis, M. 2002. Clustering validity checking methods: part II. ACM SIGMOD Record. 31:19-27.

    [30]

    Han, J.

Consensus clustering on heterogeneous distributed data