Word Files
Reference for Downloading Educational Files

Consensus clustering on heterogeneous distributed data

Number of pages: 120 File Format: word File Code: 31063
Year: 2016 University Degree: Master's degree Category: Computer Engineering

Tags/Keywords: Clustering - Consensus clustering on data - Distributed data - Heterogeneous distributed data

Part of the Content
Contents & Resources

Summary of Consensus clustering on heterogeneous distributed data

Computer engineering master thesis - software orientation

Abstract

Clustering can be considered one of the most important steps in data analysis. Many clustering methods have been developed and presented so far. One of these methods that has been studied in recent studies is consensus clustering method. The purpose of consensus clustering is to combine several initial clusterings and obtain a final clustering in such a way that the clusters have a higher quality than the clusters in the initial clusterings.

In this thesis, we will present a process for performing consensus clustering on heterogeneous distributed data, which consists of three steps. In the first stage, peer-to-peer clusters are identified in the initial clustering. In the second stage, each of the primary clusters are weighted. In the third stage, consensus clustering is done according to the weight assigned to the clusters.

In this thesis, the proposed process has been evaluated on heterogeneously distributed data. The results of the evaluations have also been compared with 4 other algorithms in the field of consensus clustering. The comparisons made show that the proposed process is more efficient than other algorithms in most cases. We discuss basic concepts such as data mining and clustering. Then the problem of consensus clustering is raised using an example. After that, the generalities of the research conducted in this thesis and the results obtained in the evaluations will be presented. Advanced computer systems face a huge amount of data that is collected from various sources such as point of sale terminals [1], bank transactions, electronic commerce, credit cards and satellites. Therefore, due to the increasing volume of data, the need for a process to analyze and extract the knowledge hidden in them is felt more than before.

Data mining [2] in a simple definition is a process to discover knowledge from large data sets. In many cases, the term data mining is used synonymously with the expression of discovering knowledge from data [3], but in fact data mining is one of the main stages of discovering knowledge. Figure 1-1. It shows the process of discovering knowledge from data, and as it is known, this process includes an iterative sequence of the following steps [37]:

1. Data cleaning[4]

2. Data integration[5]

3. Data selection[6]

4. Data transformation[7]

5. Data mining

6. Model evaluation [10]

7. Presentation of knowledge

Data mining methods

The tasks performed in data mining can be classified into two descriptive [11] and predictive [12] groups. Descriptive activities can reveal the main characteristics of the data in the database. Predictive activities also perform inferential actions on the available data in order to make predictions. Some of the most important data mining methods are: classification [13], clustering [14], discovery of association rules [15] and detection of outliers [16]. Among the proposed methods, clustering and discovering associative rules are a descriptive activity and categorization and recognition of remote data are considered a predictive activity. 1-4-Clustering Cluster analysis is one of the most important human activities. As children, we learn how to differentiate between dogs and cats, or animals and plants, by subconsciously improving clustering in our minds [30]. Clustering is often considered as the first step and one of the most important methods of data analysis. Clustering is a process in which objects are grouped into groups of similar objects. Each group or cluster includes objects that are similar to each other and different from the objects of other groups. Clustering is a form of data modeling that is rooted in mathematics and statistics [8]. Unlike classification, which is a supervised learning method [17], clustering is considered an unsupervised learning method [18], because the data in classification has a class label [19], but in clustering, the class label is specific to the data.Unlike classification, which is a supervised learning method [17], clustering is considered an unsupervised learning method [18], because the data in classification has a class label [19], but in clustering, the class label for the data is not known. The goal in clustering is to minimize the data distance within the cluster and maximize the data distance between different clusters, and therefore it is considered a kind of optimization problem. Sometimes the terms segmentation[20] and segmentation[21] are also considered synonymous with clustering in research.

Humans are not able to discover knowledge from the mass of information that is in databases without using summarization methods. Basic statistics (such as mean and variance) or frequency comparison charts [22] provide basic and little information about the data. But cluster analysis or clustering can discover more complex relationships between data objects, between specific attributes of data, or between the two [61]. Clustering has wide applications in artificial intelligence, biology, customer relationship management [23], data mining, machine learning, marketing, medicine, pattern recognition, information retrieval, and image processing. For example, in biology, clustering can create a classification of different species based on the characteristics of organisms. Another application of clustering is to better understand the function of genes in the biological processes of a cell [61]. In business, clustering helps marketers discover different groups of customers based on their buying patterns. Clustering can be used to identify groups of houses in a city according to the type of house, value and geographic location, as well as to identify groups of car insurance policy holders with high average cost. Clustering can also be used to group search engine results on the web. Figure 1-2 shows a two-dimensional drawing of the location of customers in a city, which is obtained from the clustering of information about the customers of a store [1]. There developed many clustering algorithms so far. One of the latest proposed and favored methods is consensus clustering. In this method, the goal is to combine some clusterings and reach a clustering in which the quality is more, compared to the input clusterings.

We propose a consensus clustering method which works on distributed, heterogeneous data. This process has three phases. First, we identify the correspondent clusterings among the input clusterings. At the second phase, we assign a weight to each cluster using Davies-Bouldin index. And finally, consensus clustering is performed according to the assigned weights.

We also evaluate our proposed method. The results are compared with 4 other prominent consensus clustering algorithms. This comparison certifies that our method reaches better results most of the times.
Contents & References of Consensus clustering on heterogeneous distributed data

List:

Abstract 1

Chapter 1 Introduction 2

1-1- Introduction 3

1-2- Data mining 3

1-3- Data mining methods 4

1-4- Clustering 5

1-5- Consensual clustering 9

1-6- The research conducted in the thesis 12

1-7- Results obtained 13

1-8- The structure of the thesis 13

Chapter two, an overview of the work done 14

2-1- Introduction 15

2-2- Clustering methods 15

2-2-1- Segmentation methods 17

2-2-2- Hierarchical methods 19

2-2-3- K-Means clustering algorithm 19

2-3- Consensus clustering 22

2-3-1- Motives for using consensus clustering 23

2-3-2- Clustering problem Consensus: providing an example 25

2-3-3- An overview of consensus clustering methods 26

2-3-4- Grouping consensus clustering methods 27

2-3-5- Similarity-based methods 31

Two-by-two similarity (correlation matrix) 31

Graph-based 35

2-3-6- consensus methods using mutual information 39

2-3-7- consensus methods using a hybrid model 40

2-3-8- vote-based consensus methods 42

2-4- methods of generating clustering community 46

2-5- chapter summary 49

Chapter 3 Presentation of the proposed solution: Consensus clustering on heterogeneous distributed data 51

3-1- Introduction 52

3-2- Proposed solution 53

3-2-1- Identifying the similarity of clusters 53

3-2-2- Weighted clustering 60

3-2-3- Consensual clustering on heterogeneous distributed data 64

3-3- Production of clustering community 67

3-4- Summary of chapter 68

Chapter 4 Implementation of the proposed solution and its evaluation results 70

4-1- Introduction 71

4-2- Evaluation criteria 71

4-2-1- Accuracy criterion 72

4-2-2- Davies-Bouldin index 73

4-2-3- Rand index73

4-2-4- Average normalized bilateral information (ANMI) 75

4-3- Implementation 76

4-4- Data sets 76

4-5- Evaluation results78

4-5-1- Accuracy criterion 78

4-5-2- Davies-Bouldin index81

4-5-3- Rand index 83

4-5-4- Average normalized bilateral information (ANMI) 85

4-6- Summary of chapter 87

Chapter Five Conclusion and future works 88

5-1- Introduction 89

5-2- Conclusion 89

5-3- Future works 92

References 94

Appendix A: List of abbreviations 100

Appendix B: English to Persian dictionary 101

Appendix C: Persian to Persian dictionary English 107

Source:

[1]

Agarwal, P. K., Har-Peled, S., & Yu, H. 2013. Embeddings of surfaces, curves, and moving points in Euclidean space. SIAM Journal on Computing, 42(2), 442-458.

[2]

Alam, S., Dobbie, G., Koh, Y. S., & Riddle, P. 2013, April, Clustering heterogeneous web usage data using hierarchical particle swarm optimization, In Swarm Intelligence (SIS), 2013 IEEE Symposium on (pp. 147-154). IEEE.

[3]

Al-Zoubi, M. B., Hudaib A., Huneiti A. and Hammo B. 2008. New Efficient Strategy to Accelerate k-Means Clustering Algorithm. American Journal of Applied Sciences. 5:1247-1250

[4]

Amig?, E., Gonzalo, J., Artiles, J. and Verdejo, F. 2008. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Journal of Information Retrieval. Springer.

[5]

Arthur, D. and Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms. p:1027-1035.

[6]

Ayad, H. G. 2008. Voting-Based Consensus of Data Partitions. PhD Thesis (In University of Waterloo).

[7]

Ayad, H. G. and Kamel, M. S. 2005. Cluster-based cumulative ensembles. In Multiple Classifier Systems: Sixth International Workshop, MCS 2005. Seaside, CA, USA. p:236–245.

[8]

Belghini, N., Zarghili, A., Kharroubi, J., & Majda, A. 2011, January. Sparse Random Projection and Dimensionality ReductionSparse Random Projection and Dimensionality Reduction Applied on Face Recognition. In The Proceedings of International Conference on Intelligent Systems & Data Processing (pp. 78-82).

[9]

Berkhin, P. 2006. Survey on Clustering Data Mining Techniques. Grouping Multidimensional Data. Springer. p:25-71.

[10]

Boulis, C. and Ostendorf, M. 2004. Combining multiple clustering systems. In The 8th European conference on Principles and Practice of Knowledge Discovery in Databases(PKDD), LNAI 3202. p:63–74.

[11]

Chunsheng, H., Qian, C., Haiyuan, W. and Wada, T. 2008. RK-Means Clustering: K-Means with Reliability. IEICE transactions on information and systems. 91(1):96-104.

[12]

David, G. and Thomas, H. 2005. Non-redundant clustering with conditional ensembles. The 11th ACM SIGKDD international conference on knowledge discovery in data mining. p:70-77.

[13]

Dimitriadou, E., Weingessel, A. and Hornik, K. 2002. A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence. 16:901–912.

[14]

Domeniconi, C. and Al-Razgan, M. 2007. Weighted Cluster Ensembles: Methods and Analysis. Technical Report ISE-TR-07-06.

[15]

Domininique, V., Abdi, H., Williams, L. J., Bennani?Dosse, M. 2012. Statis and disstatis: optimum multitable principal component analysis and three way metric multidimensional scaling. Wiley Interdisciplinary Reviews: Computational Statistics, 4(2), 124-167.

[16]

Duda, R. O., Hart, P. E., & Stork, D. G. 2012. Pattern classification. John Wiley & Sons.

[17]

Dudoit, S. and Fridlyand, J. 2003. Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 19(9):1090-1099

[18]

Elkan, C. 2003. Using the triangle inequality to accelerate k-means. Proceedings of the 20th International Conference on Machine Learning (ICML-2003).

[19]

Fischer, B. and Buhmann, J. M. 2003. Bagging for path-based clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 25:1411–1415.

[20]

Fred, A. 2001. Finding consistent clusters in data partitions. The Second International Workshop on Multiple Classifier Systems. Springer-Verlag. p:309-318.

[21]

Fred, A. and Jain, K. A. 2002. Evidence Accumulation Clustering Based on the K-Means Algorithm. The Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition. Springer-Verlag. p:442-451.

[22]

Gasieniec, L., Jansson, J. and Lingas, A. 2004. Approximation algorithms for Hamming clustering problems. Journal of Discrete Algorithms. Elsevier. 2:289-301

[23]

Gionis, A., Mannila, H. and, Tsaparas, P. 2005. Clustering Aggregation. In Proceedings of Twenty-first International Conference on Data Engineering (ICDE). p:341-352.

[24]

Guillaume, R., & Mouaddib, N. 2002. SAINTETIQ: a fuzzy set-based approach to database summarization. Fuzzy sets and systems, 129(2), 137-162.

[25]

Gondek, D. and Hofmann, T. 2004. Non-redundant data clustering. In Proceedings of the Fourth IEEE International Conference on Data Mining. p:75–82.

[26]

Gordon, A. D. and Vichi, M. 2001. Fuzzy partition models for fitting a set of partitions. Psychometrika. 66:229–248.

[27]

Greene, D., Tsymbal A., Bolshakova, N. and Cunningham P. 2004. Ensemble Clustering in Medical Diagnostics. Proceedings of the 17th IEEE Symposium on Computer-Based Medical Systems. p:576-581.

[28]

Gupta, M., & Han, J. 2011, Heterogeneous network-based trust analysis: a survey, ACM SIGKDD Explorations Newsletter, 13(1), 54-71.

[29]

Halkidi, M., Batistakis, Y. and Vazirgiannis, M. 2002. Clustering validity checking methods: part II. ACM SIGMOD Record. 31:19-27.

[30]

Han, J.

How To Access The File

Fuzzy clustering of data based on fuzzy logic

Number of pages: 46 Category: Computer Engineering

Dissertation for Master's Degree in Computer Engineering - Artificial Intelligence Abstract Data clustering is a method for classifying similar data, which has been used for many years in various sciences and many algorithms have been designed in this field. Recent clustering research leads to hybrid methods that are more robust and accurate. Hybrid clustering tries to first ...

Investigating dynamic data replication algorithms in grid networks and presenting a new algorithm based on parameters of file size, available bandwidth and geographical distance.

Number of pages: 81 Category: Computer Engineering

Dissertation of the master's course in computer engineering, software orientation. Abstract The necessity of increasing use of distributed data in computer networks is clear to everyone. A large number of computing and storage resources are placed together and form the grid. In recent years, grid technology has grown significantly, so that it has been used in most scientific ...

Presenting an ant community algorithm to improve the time of doing tasks in the grid environment

Number of pages: 85 Category: Computer Engineering

Dissertation for M.Sc. Abstract In this thesis, we have presented a new method in network processing with Ant algorithm. The model we used in the network space is a continuous two-way auction. Due to their simplicity and dynamics, these models are used in many algorithms used to control resources and schedule tasks. Many of these models have weaknesses in their response time ...

Presenting a dynamic target tracking algorithm based on prediction in wireless sensor network

Number of pages: 104 Category: Computer Engineering

Master's Thesis of Computer Engineering - Computer Architecture Abstract With the advancement of electronic device manufacturing technology and the cost-effectiveness of large-scale sensor networks, wireless sensor networks provide research fields with rapid growth and great interest, which have attracted a lot of attention in recent years. Large-scale wireless sensor networks ...

Presenting a peer-to-peer (P2P) botnet detection method based on cluster similarity

Number of pages: 61 Category: Computer Engineering

Dissertation for Master's Degree (M.S.c) Field: Computer Orientation: Software Abstract Today, the use of botnets as a tool for criminal activities with a large scope in computer networks against large targets such as a country has greatly increased. Bot is a distributed environment that is used for various attacks with a large volume. For this reason, the detection of this type ...

Modeling analysis and simulation of scalability in wireless sensor networks with capacity criterion

Number of pages: 138 Category: Electronic Engineering

Master's thesis in telecommunications-systems, the abstract of the current research is about the issue of scalability in wireless sensor networks with imaging capability, which has analyzed, mathematically modeled and simulated the scalability by considering a relatively practical scenario of the sensor network, and based on the performance criteria of outage capacity and ...

Improvement of GM-PHD algorithm for multi-target and multi-sensor tracking with the help of bias estimation

Number of pages: 120 Category: Electronic Engineering

Field of study: Electricity, specialization: Telecommunication. Dissertation Abstract In this thesis, the problems of multi-objective and multi-sensor tracking are investigated using the rule of random particle collection. The regression probability density hypothesis (PHD) is implemented dynamically. This dynamic mode is performed by combining the transfer bias value with the ...

Analysis, modeling and simulation of scalability in wireless sensor networks with capacity criterion

Number of pages: 118 Category: Electrical Engineering

The abstract of the current research is about the issue of scalability in wireless sensor networks with imaging capability, which has analyzed, mathematically modeled and simulated the scalability by considering a relatively practical scenario of the sensor network, and based on the performance criteria of outage capacity and ergodic capacity of the network. Scalability is ...

Reducing energy consumption in wireless sensor network using Bloom filter

Number of pages: 138 Category: Electrical Engineering

Dissertation for Master's Degree in Electrical Engineering, Electronics Orientation Abstract: Wireless sensor networks consist of a large number of small sensor nodes, which have limitations in energy level, bandwidth, processing power, and memory. Therefore, routing, clustering, reducing energy consumption, and increasing network life are the main challenges of routing in ...

Achieving quality of service in wireless sensor networks using cellular learning automata

Number of pages: 207 Category: Computer Engineering

Master's Thesis in Computer-Software (M.Sc) Abstract The quality of service in wireless sensor networks is very different compared to traditional networks. Some of the parameters that are used in evaluating the quality of service in these networks are: network coverage, optimal number of active nodes in the network, network lifetime and energy consumption. In this thesis, three ...

Consensus clustering on heterogeneous distributed data

Summary of Consensus clustering on heterogeneous distributed data

Contents & References of Consensus clustering on heterogeneous distributed data