Presenting a new method in information clustering using a combination of bat algorithm and Fuzzy c-means

Number of pages: 102 File Format: word File Code: 30894
Year: 2012 University Degree: Master's degree Category: Industrial Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Presenting a new method in information clustering using a combination of bat algorithm and Fuzzy c-means

    Master's thesis in the field of automation and instrumentation engineering

    Abstract

    Presenting a new method in information clustering using a combination of bat algorithm and Fuzzy c-means

    Clustering is placing data in groups where the members of each group are similar from a certain angle.  The similarity between the data within each cluster is maximum and the similarity between data within different clusters is minimum.

    Fuzzy c-means is also a fuzzy clustering technique which, despite being sensitive to initialization and convergence to local optimal points, is one of the most common methods due to its efficiency and easy implementation. In this thesis, to solve the existing problems, the combined method based on the bat algorithm and Fuzzy c-means will be used. In order to validate, the proposed method will be implemented on several well-known different data and the results will be compared with forbidden search algorithm, ants, particle community, steel plating and k-means methods. The high ability and robustness of this method will be evident based on the results. 

    Introduction

    Data and patterns are one of the most important indicators in the world of information, and clustering is one of the best methods that have been provided to work with data. Its ability to enter the data space and recognize their structure has made clustering one of the most ideal mechanisms for working with the huge world of data.

    In clustering, samples are divided into categories that are not known in advance. Therefore, clustering is a learning method that independently categorizes data without prior knowledge and observing pre-defined samples.

    Clustering is actually finding structure in unclassified data sets. In other words, clustering is putting data into groups where the members of each group are similar from a certain angle. As a result, the similarity between data within each cluster is maximum and the similarity between data within different clusters is minimum. The criterion of similarity here is distance, which means that samples that are closer to each other are placed in a cluster. Therefore, calculating the distance between two data is very important in clustering; Because the quality of the final results will change.

    Distance, which is the representative of inhomogeneity, enables movement in the data space and causes the creation of clusters. By calculating the distance between two data, you can understand how close these two data are to each other and whether they are in the same cluster or not? There are various mathematical functions to calculate the distance; Euclidean distance, Hamming distance and .

    1-1-Problem statement

    Clustering is finding the structure in unlabeled data sets and it can be considered as the most important problem in unsupervised learning. The idea of ??clustering was first proposed in the decade of 1935, and today it has been present in various applications and aspects with the huge advances and leaps that have occurred in it. A simple search on the web or even in a library database reveals its amazing utility.  Clustering algorithms are used in various fields, the following can be listed as examples:

    Data mining[1]: discovering new information and structure from existing data

    Speech recognition[2]: in building a codebook from feature vectors, in dividing speech according to its speakers or speech compression

    Image segmentation[3]: segmenting medical or satellite images

    Web (WWW): classification of documents or classification of sites and .

    Biology[4]: classification of animals and plants based on their characteristics

    Urban planning[5]: classification of houses based on their type and geographical location

    Seismography studies [6]: detection of accident-prone areas based on previous observations

    Library: classification of books

    Insurance: detection of fraudulent people

    Marketing[7]: categorizing customers into categories according to their needs through the collection of their latest purchases.

    Due to the increasing use of clustering, today we are witnessing the presentation of new and more efficient methods, each of which is provided for a specific application. But despite all these efforts, clustering is still not used as much as it should be in many sciences and there is a lot of potential for it to be expanded.

    1-2-Research Background

    We live in a world full of data and every day we are faced with a large amount of storing or displaying information. One of the vital methods of controlling and managing these data is clustering. In this method, data that have similar properties are placed in a category or a cluster. For the first time, the idea of ??clustering was presented in the 1935s, and today it has attracted the attention of many researchers with the huge advances and leaps that have occurred in it. Therefore, it has been present in various applications and aspects, and various methods have been proposed for its exploitation [1]. In one sense, clustering algorithms can be divided into two general categories: hard clustering and fuzzy clustering. In hard clustering, a data belongs to one and only one cluster, while in fuzzy clustering, a data may belong to two or more clusters at the same time [2], [3], [4]. Fuzzy c-means algorithm is one of the famous fuzzy clustering methods that can be easily implemented. Unfortunately, its original version has limitations such as dependence on initial values ??and convergence to the local optimal response [5], [6]. In the genetic algorithm, these limitations have disappeared. At the same time, by combining these two algorithms, significant results have been obtained, and the speed of convergence has also increased far more than the previous examples [7]. By combining two genetic algorithms and PSO, Kao and his colleagues invented a method in which he used mutation and crossover operators for genetics. This method was able to solve various problems of continuous functions. Also, significant changes have been achieved in finding the global optimal solution and the convergence ratio [8]. Using the combination of genetic algorithm and fuzzy method, a method was suggested by Asgarian in 2016. In this method, the problem of dependence on the initial number of clusters and the initial location of their centers is high and with the inability to cluster data whose distance from the centers of several clusters is the same; It was countered. Another advantage of this combination is reducing the complexity of calculations [9]. Another combination method that is used in data mining problems is the use of the combination of Fuzzy c-means and PSO, which was able to improve the problem of convergence to local optimality and the speed of convergence [10], [11]. Another new combined method is the combination of FCM algorithm and fuzzy memetic algorithm. In order to improve clustering performance, the results of this technique show that it has better answers and its stability is also higher [12]. The combination of FCM and SA is another example of combined methods used in cancer diagnosis [13], [14], [15], [16]. In line with the mentioned efforts, this thesis tries to take advantage of the advantages of two algorithms in solving clustering problems by using the combination of FCM algorithm and bat algorithm.

    1-3-Research Objective

    The purpose of this research is to present an algorithm that can cover the existing limitations to an acceptable extent by examining the existing algorithms in the field of clustering. Some of the existing limitations can be listed as follows:

    × Performance for high-volume databases

    × Discovery of clusters with different shapes

    × Insensitivity to the order of input data

    × Ability to interpret and use

    1-4-Importance of research

    Simultaneously with the increase of database systems and multiple tools for storing large volumes of data, there is a need for automatic methods to fully discover knowledge from within the data. It was felt. In addition, due to the high cost of human and material resources to perform operations on massive amounts of data, it was necessary to provide methods with minimal user intervention. Extracting appropriate information from the masses of data and turning them into knowledge needed by organizations - especially in organizational decision-making - required the use of new methods in this field. Data mining [8] is one of these tools that helps to discover knowledge from databases. It can be said that data mining is extracting valid, understandable and reliable information from very large databases, which helps to discover hidden patterns and reliable relationships between data and use it in decision making. In fact, knowing and dealing with data is one of the important goals in data mining.

    This process was introduced in the late 90s and entered statistical discussions seriously since 1995, and it is currently one of the most important tools for the effective use of large amounts of data, and its importance is increasing every day.

  • Contents & References of Presenting a new method in information clustering using a combination of bat algorithm and Fuzzy c-means

    List:

    Title

    1- Chapter One: Introduction .. 2

    1-1- Statement of the problem .. 3

    1-2- Research background .. 4

    1-3- Research goal .. 5

    1-4- The importance of research .. 5

    1-5- Dissertation speeches . 8

    2- Second chapter: clustering based on Fuzzy c-means algorithm . 10

    2-1- Introduction .. 11

    2-2- Information clustering . .13

        2-2-2-clustering applications. 13 2-2-3- Types of clusters. 15 2-2-4- Clustering steps. 15 2-2-5- Types of clustering methods. 18

    2-2-6- Hierarchical clustering. 18

    2-2-6-1- Dividing hierarchical clustering. 19

    2-2-6-2- Condensing hierarchical clustering. 19

    Title

    2-2-7- Partition clustering. 22

    2-2-7-1- k-means algorithm. 23

    2-2-8- Clustering Overlap. 26

    2-2-8-1- Fuzzy clustering. 27

    3- Third chapter: Optimization based on bat algorithm. 33

    3-1- Introduction .. 34

    3-2- Description of the optimization problem. 35

    3-3- methods of solving optimization problems. 39

        3-3-1- particle mass optimization algorithm.  43

        3-3-2- bee mating algorithm. 45

    3-3-3- Ant algorithm. 46

    3-3-4- Prohibited search pattern algorithm. 48

        3-3-5-steel plating algorithm. 49

    3-3-6- Bat algorithm. 51

        7-3-3- Suggested solutions to improve the performance of the bat algorithm. 54

    3-3-7-1- Selection of the initial population based on the null rule of the opposite number. 54

         3-3-7-2- self-adjusting mutation strategy. 55

    3-4- Comparison criteria of optimization algorithms. 58

          3-4-1- Efficiency.. 58

          3-4-2- Standard deviation. 58

    3-4-3- Reliability. 59

          3-4-4- Convergence speed. 59

    Title 5-3- Definition of various numerical problems. 60

        3-5-1-Rosenbrock function. 61

    3-5-2- Schewefel function. 62

    3-5-3- Rastragin function. 63

    3-5-4- Ashley function. 64

    3-5-5- Greiwank function. 65

    4- Fourth chapter: proposed algorithm 4-1- Introduction .. 67

    4-2- Information clustering by the proposed combined method. 68

    4-3- Setting the parameters of the proposed algorithm. 71

    4-4- Examining the results of the proposed algorithm and comparing it with other algorithms. 71

        4-4-1- Introducing the data used and the simulation results related to it. 72     

          4-4-1-1- Iris data set. 72

    4-4-1-2- Wine dataset. 75

    4-4-1-3- CMC data set. 77

    4-4-1-4- Vowel dataset. 80

    5- The fifth chapter: conclusion and suggestions. 82

    5-1- Conclusion.. 83

    5-2- Suggestions for future works. 84

    Table List

    Title and Page Number

    Table1 Table 2 benefits and disadvantages of algorithm K-Means

    Table 2-2 Advantages and disadvantages of fuzzy average c algorithm. 31

    Table 2-3 Similarity criteria based on different distance functions. 32

    Table 3-1 Numerical functions used to test algorithms. 60

    Table 4-1 Parameters related to the proposed algorithms. 71 Table 4-2 Cluster centers obtained by running the FCM-BA algorithm on the Iris dataset. 73

    Table 4-3 Algorithm response available on the Iris dataset. 74

    Table 4-4 FCM-BA algorithm response74

    Table 4-4 FCM-BA algorithm response based on different parameter values ??on the Iris data set. 74

    Table 4-5 response of existing algorithms on Wine data set. 75

    Table 4-6 cluster centers obtained by running FCM-BA algorithm on Wine data set. 76

    Table 4-7 FCM-BA algorithm response based on different values ??of parameters on Wine data set. 77

    Table 4-8 Cluster centers obtained by running the proposed algorithm on the CMC dataset. 78

    Table 4-9 Answers of existing algorithms on the CMC dataset. 79

    Table 4-10 Answers of the FCM-BA algorithm on different values ??of parameters on the CMC dataset. 79

    Table 4-11 Cluster centers obtained by running the proposed algorithm on the Vowel dataset. 80

    Table 4-12 Answers of the existing algorithms on the set Vowel data. 80

    Table 4-13 FCM-BA algorithm response based on different parameter values ??on the Vowel data set.  81

    Source:

    References

     

    [1]M.R. Anderberg, 'Cluster Analysis for Applications.', New York Academic Press, 1973.

    [2]J.A. Hartigan, "Statistical theory in clustering.", Journal of Classification, 1985, Vol.2, pp.63-76.

    [3]Jon R Kettering, "The Practice of Cluster Analysis.", Journal of Classification, 2006, Vol.23, pp.3-30.

    [4]J.J. H. Ward, "Hierarchical Grouping to Optimize an Objective Function.", Journal of the American Statistical Association, 1963, Vol.58, pp.236-244.

    [5]J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations.", Fifth Berkeley Symp. Math. Statistics and Probability, 1967, Vol. 2. pp.281-297.

    [6] Bezdek, J. "Fuzzy mathematics in pattern classification", Ph.D. thesis. Ithaca, NY: Cornell University, 1794

    [7] I. Karen, A.R. Yildiz, N. Kaya, N. Ozturk, F. Ozturk, Hybrid approach for genetic algorithm and Taguchi's method based design optimization in the automotive industry, International Journal of Production Research 4 (2006) 4897-4914 [8] Yi-Tung Kao, Erwie Zahara, I-Wei Kao, “A hybridized approach to data clustering", Expert Systems with Applications, 2008, Vol.34. pp.1754-1762.

    [9] Ehsan Asgarian, Hossein Moinzadeh, Mohsen Siriani, Jafar Habibi "A new approach for fuzzy clustering by genetic algorithm.", 13th annual conference of the Iranian Computer Association. 1386.

    [10] Hesam Izakian, Ajith Abraham, "Fuzzy C - means and fuzzy swarm for fuzzy clustering problem", Expert Systems with

    Applications 38, 1835–1838, 2011 [11] K.S.F. Shu, Z. Erwie, A hybrid simplex search and particle swarm optimization for unconstrained optimization, European Journal of Operational Research 181 (2007) 527–548. [12] Fatemeh Golichenari, Mohammad Saniee Abadeh, A new Method For Fuzzy Clustering Based on Fuzzy C-means Algorithm and Memetic Algorithm, 2007. [13] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, "Optimization by Simulated Annealing", Science, 220, 4598, pp. 671-680, 1983. [14] Saeed parsa, Hamid saadi, Hamid mohamadi, Scheduling jobs on computational grid using simulated annealings, 2007 [15] Suman, B. (2004) "Study of simulated annealing based algorithms for multi objective optimization of a constrained problem", Computers and Chemical Engineering, Volume 28, Issue 9, pp. 1871-1849.

    [16] Zhang, R. and Wu, C. (2010) "A hybrid immune simulated annealing algorithm for the job shop scheduling problem", Applied Soft Computing, 10, pp. 79-89.

    [17] Aida Khayabani, Jamal Shahrabi, Rasool Aliannejad, Arash Sabbaghi, "The use of data mining in the diagnosis of tuberculosis", 3rd Iran data mining conference, 2018. 1388.

    [19] J. C.

Presenting a new method in information clustering using a combination of bat algorithm and Fuzzy c-means