Searching for emerging patterns with streaming features

Number of pages: 132 File Format: word File Code: 31061
Year: 2013 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Searching for emerging patterns with streaming features

    Master's Thesis in Computer Engineering (Artificial Intelligence)

    Abstract

     

    Extracting useful patterns from data sets is one of the challenging topics in data mining. On the other hand, in high-dimensional data, extracting a small set of emerging patterns with strong predictability is one of the important issues in creating a classifier based on emerging patterns. In the real world, features are not always fully available; Accordingly, the problem becomes more difficult when the feature set is unknown before the learning process begins. Streaming features are features that are generated online and processed at the same time as production. In this design, the features emerge one by one over time instead of all the features being ready before the learning process.

    In this study, we propose a dynamic structure of the frequent pattern tree so that the tree is built as soon as new features arrive and the emerging patterns can be extracted online. DFP-SEPSF provides an efficient bottom-up method to construct an unordered dynamic recurrent pattern tree UDFP-tree and an ordered dynamic recurrent pattern tree ODFP-tree. The first method does not consider the order of items, while the second method applies the order of items.

    Furthermore, the proposed framework extracts strong emerging patterns to create a robust and fast classifier that can deal with noise.

    The proposed method significantly reduces the search space of emerging patterns and extracts emerging patterns with strong discriminating power by removing useless patterns.

    The proposed method of patterns It simultaneously discovers emergents for each class and, in addition, guides the process of generating recurrent pattern trees efficiently in order to reduce calculations.

    Evaluation of our experiences on a wide range of data shows the effectiveness of the proposed method compared to other known methods in terms of prediction accuracy, number of extracted patterns and execution time.

    Keywords:

    Emergent patterns, dynamic recurrent pattern, order of item tree. , flow features

    Chapter One

    Introduction

    Classification[1] is one of the basic tasks in data mining[2], which has been widely studied in the field of machine learning[3], neural networks[4] and pattern recognition[5]. The input is a collection of training examples[6] that includes several features[7]. Features can be separated into two categories, discrete features[8] and continuous features[9], according to their range of values. In general, a clause class [10] produces a concise and meaningful description (model [11]) for each class label [12] in relation to the properties. Then, the model is used to predict the class label of unknown samples [13]. Classification is also known as supervised learning [14] where each training instance has a class label. While, unsupervised learning[15] or clustering[16] searches and categorizes homogeneous groups of objects based on their attribute values; In fact, instances do not have class tags. Classification is successfully used in a wide range of applications, including scientific experiments [17], drug detection [18], weather forecasting [19], authentication [20], customer segmentation [21], target marketing [22] and fraud detection [23].

    Classification based on patterns [24] is considered a new methodology. Discovering patterns that indicate the distinction between different classes is considered one of the important topics in data mining. In this research, we extract the classification based on patterns called emerging patterns [25] (Emerging Patterns) that clearly show the distinction between classes, from the data set [26] and then, based on them, we perform the classification. to depict significant differences between classes [1]. An emerging pattern is an inflectional combination of features whose probability of presence in one class changes significantly compared to other classes [1,2]. These patterns are useful because they are able to express the distinction between classes.If the frequency [27] of any pattern in a class is significant compared to other classes, it indicates that this pattern is specifically assigned to this class, and on the other hand, these types of patterns are of particular importance for databases where there is a time limit for extracting knowledge from them. can be) greater than a threshold value. This threshold value should be chosen in such a way that the extracted patterns show the difference and distinction between different classes. These patterns are actually a set of items that express the inflectional combination between feature values ??[2]. Typically, the number of extraction patterns is very large, but only a small number of these patterns are desirable and useful for data analysis and classification. Since a large amount of these patterns are irrelevant [29] and repetitive [30], they do not provide new knowledge and therefore have an adverse effect on the accuracy of the classifier, which causes a decrease in the prediction accuracy [31]. To increase the efficiency [32] and accuracy, a procedure should be developed to remove dependent and unhelpful patterns to reduce the number of these patterns. An emerging pattern with a high probability in its own class and a low probability in its opposite class can be used to determine a test sample. The strength of this pattern is expressed by measures such as relative frequency [33] and its growth rate (probability ratio of the pattern in one class compared to other classes).

    In many applied fields such as knowledge discovery from genetic data [34], image processing [35], intrusion detection [36], outlier detection [37], fraud detection [38], unbalanced data [39], data flow [40], bioinformatics [41], system proposed [42], it is necessary to detect a sudden change in the data. Emerging patterns extract sudden changes and significant differences from the data. Emergent patterns, in the field of image processing for segmentation, work in such a way that they try to introduce a new segment in the pixels that have a sudden change in intensity [43]. In the field of intrusion and fraud detection, the behavior of the data is tracked, when the behavior of the data changes suddenly, it is recognized as an intrusion. In recommender systems, the system looks for the specific behaviors of each user in order to discover the specific characteristics of each user and suggest products according to their interests and talents. Therefore, emerging patterns play a significant role in this regard.

    The concept of streaming features[44]

    In streaming data[45], samples are received over time while the number of features is fixed. But in streaming features, the number of learning data is fixed, but the features are generated dynamically, and the learning algorithm receives the features over time [31, 32]. In the flow features of the routine, the features are produced by feature generation methods such as statistical relational learning methods [46] and interactions between features [47]. The problems that arise after the production of features by these methods are as follows: 1) Millions or even billions of features are produced, which due to memory limitations, it is possible to store this amount of features, and on the other hand, a lot of time must be spent to start the learning process. 2) The features are produced by the queries in SQL, the execution of these queries is limited to the time of the processor [48], almost the processor executes every hundred thousand queries in 24 hours. On the other hand, many features are irrelevant and repetitive production[49]. This shows that a small number of these production features are effective in the learning process and therefore the production of features is costly [32]. Based on this, in order to overcome these problems, the concept of flow characteristics was formed and an effort was made to guide the production process of characteristics by producing the dynamics of characteristics and examining these characteristics during production and its impact on the learning process.

    To deal with the challenges raised, the learning process must be able to respond to flow characteristics. In fact, the learning routine should be incrementally updateable upon receiving each feature without returning to the first learning step. Therefore, in order to extract strong patterns, the features should be checked first and the features that are irrelevant should be removed, then the patterns should be extracted from the useful and strong features.

  • Contents & References of Searching for emerging patterns with streaming features

    List:

    Chapter 1 .. 1

    1- Introduction .. 2

    1-1 Introduction .. 2

    1-2 The concept of emerging patterns.  3

    1-3 concept of flow characteristics.  5

    1-4 challenges in extracting emerging patterns.  6

    1-5 algorithms for extracting emerging patterns.  8

    1-6 main research idea.  11

    1-7 An overview of the treatise chapters.  13. Chapter Two.  15

    2-2-1 Classification Based on Association (CBA) method.  15

    2-2-2 Classification method based on Multiple-class Association Rule (CMAR) 16

    2-2-3 Classification method based on Prediction Association Rule (CPAR) 16

    2-3 Pattern extraction methods.  17

    2-3-1 border-based method.  17

    2-3-2 constraint-based method.  17

    2-3-3 Algorithm for extracting CP-tree conflict pattern tree.  18

    2-3-4 mining method with the help of zero binary diagram ZBDD Miner.  18

    2-3-5 method of extracting distinct emerging patterns DP-Miner.  18

    2-4 classification methods based on emerging patterns.  20

    2-4-1 classification method based on the total of CAEP emerging patterns.  20

    2-4-2 classification algorithm based on iCAEP information theory.  20

    2-4-3 classification method based on emerging mutation patterns JEPs-classifier.  21

    2-4-4 classification method based on emerging strong mutational patterns.  21

    2-4-5 decision-making method based on DeEPs sample.  21

    2-4-6 classification method by PCL right-hand set.  22

    The third chapter... 23

    3- Basic knowledge... 24

    3-1 Emerging patterns. 24

    3-2 DFP-tree dynamic recurrent pattern tree.  30

    Chapter 4 .. 33

    4- The proposed solutions for extracting strong emerging patterns based on current features.  34

    4-1 Introduction .. 34

    4-2- Unordered Dynamic Unordered Dynamic FP-tree.  35

    4-3 Ordered Dynamic FP-tree.  44

    4-4 SEP-Miner pattern extraction method.  56

    Chapter 5 .. 62

    5- Experimental tests.  63

    5-1 Introduction .. 63

    5-2 Classes of clauses.  63

    5-2-1 decision tree clause class C4.5.  63

    5-2-2 SVM class.  64

    5-2-3 simple Bayesian band class.  65

    5-2-4 nearest neighbor class.  66

    5-2-5 AdaBoost algorithm. 66

    5-3 Statistical tests.  68

    5-3-1 paired t-tets statistical test.  68

    5-3-2 Wilcoxon statistical test.  68

    5-3-3 Fredman's statistical test.  69

    5-4 experimental settings.  71

    5-5 Comparison of prediction accuracy.  73

    5-6 Comparison of the number of patterns.  81

    5-7 comparison of execution time.  83

    5-8 Analysis of the order effect in the construction of the dynamic recurrent pattern tree.  86

    5-9 How to determine the minimum threshold of relative frequency.  88

    5-10 Sensitivity analysis on minimum growth rate thresholds.  89

    5-11 Comparison of DFP-SEPSF performance without knowing the entire feature space.  90

    5-12 Summary of experimental results.  94

    Sixth chapter... 96

    6- Conclusion and future works.  97

    Abbreviations .. 99

    Persian to English dictionary.  100

    English to Persian dictionary.  108

    List of references          

     

    Source:

    Dong, Guozhu, and Jinyan Li. "Efficient mining of emerging patterns: Discovering trends and differences." In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 43-52. ACM, 1999.

    Zhang, Xiuzhen, Guozu Dong, and Ramamohanarao Kotagiri. "Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets." In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 310-314. ACM, 2000.

    Li, Jinyan, Guozhu Dong, and Kotagiri Ramamohanarao. "Making use of the most expressive jumping emerging patterns for classification." Knowledge and Information systems 3, no. 2 (2001):2 (2001): 131-145.

    Lo, David, Hong Cheng, Jiawei Han, Siau-Cheng Khoo, and Chengnian Sun. "Classification of software behaviors for failure detection: a discriminative pattern mining approach." In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 557-566. ACM, 2009.

    Li, Jinyan, Huiqing Liu, James R. Downing, Allen Eng-Juh Yeoh, and Limsoon Wong. "Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients." Bioinformatics 19, no. 1 (2003): 71-78.

    Fang, Gang, Gaurav Pandey, Wen Wang, Manish Gupta, Michael Steinbach, and Vipin Kumar. "Mining low-support discriminative patterns from dense and high-dimensional data." Knowledge and Data Engineering, IEEE Transactions on 24, no. 2 (2012): 279-294.

    Mao, Shihong, and Guozhu Dong. "Discovery of highly differential gene groups from microarray gene expression data using the gene club approach." Journal of Bioinformatics and Computational Biology 3, no. 06 (2005): 1263-1280.

    Boulesteix, Anne-Laure, Gerhard Tutz, and Korbinian Strimmer. "A CART-based approach to discover emerging patterns in microarray data." Bioinformatics 19, no. 18 (2003): 2465-2472.

    Li, Jinyan, and Limsoon Wong. "Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns." Bioinformatics 18, no. 5 (2002): 725-734.

    Wu, Xindong, Kui Yu, Hao Wang, and Wei Ding. "Online streaming feature selection." In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 1159-1166. 2010.

    Zhou, Jing, Dean P. Foster, Robert A. Stine, and Lyle H. Ungar. "Streamwise feature selection." The Journal of Machine Learning Research 7 (2006): 1885-1861.

    Dong, Guozhu, and Jinyan Li. "Mining border descriptions of emerging patterns from dataset pairs." Knowledge and Information Systems 8, no. 2 (2005): 178-202.

    Li, Jinyan, Kotagiri Ramamohanarao, and Guozhu Dong. "The Space of Jumping Emerging Patterns and Its Incremental Maintenance Algorithms." In ICML, pp. 551-558. 2000.

    Bayardo Jr, Roberto J. "Efficiently mining long patterns from databases." In ACM Sigmod Record, vol. 27, no. 2, pp. 85-93. ACM, 1998.

    Han, Jiawei, Jian Pei, and Yiwen Yin. "Mining frequent patterns without candidate generation." In ACM SIGMOD Record, vol. 29, no. 2, pp. 1-12. ACM, 2000.

    Han, Jiawei, Jian Pei, Yiwen Yin, and Runying Mao. "Mining frequent patterns without candidate generation: A frequent-pattern tree approach." Data mining and knowledge discovery 8, no. 1 (2004): 53-87.

    Fan, Hongjian, and Ramamohanarao Kotagiri. "An efficient single-scan algorithm for mining essential jumping emerging patterns for classification." In Advances in Knowledge Discovery and Data Mining, pp. 456-462. Springer Berlin Heidelberg, 2002.

    Loekito, Elsa, and James Bailey. "Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams." In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 307-316. ACM, 2006.

    Yu, Kui, Wei Ding, Dan A. Simovici, and Xindong Wu. "Mining emerging patterns by streaming feature selection." In Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 60-68. ACM, 2012.

    Li, Jinyan, Guozhu Dong, and Kotagiri Ramamohanarao. "Instance-based classification by emerging patterns." In Principles of Data Mining and Knowledge Discovery, pp. 191-200. Springer Berlin Heidelberg, 2000.

    Dong, Guozhu, Xiuzhen Zhang, Limsoon Wong, and Jinyan Li. "CAEP: Classification by aggregating emerging patterns." In Discovery Science, pp. 30-42. Springer Berlin Heidelberg, 1999.

    Quinlan, John Ross. C4. 5: programs for machine learning. Vol. 1. Morgan Kaufmann, 1993

Searching for emerging patterns with streaming features