Detection of web spam using data mining techniques

Number of pages: 95 File Format: word File Code: 31058
Year: 2014 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Detection of web spam using data mining techniques

    Master Thesis (M.sc)

    Abstract:

    Nowadays, spam [1] is one of the main problems of search engines, because it makes the quality of search results unfavorable. In recent years, there have been many advances in detecting fake pages, but new spamming techniques have also emerged in response. It is necessary to improve anti-spam techniques to overcome these attacks.

    A common problem we encounter in this field is that many documents are ranked high by the search engine when they do not deserve it. Considering the ever-increasing expansion of the web and the emergence of new spam techniques by spammers, the purpose of this thesis is to investigate data mining methods to better identify spam pages from non-spam.

    Algorithms and data mining software are among the tools used in this research. The UK2007 standard data set and Waka software have been used to provide optimal models, and an attempt is made to provide models that reduce the features used to identify spam pages from non-spam and provide optimal efficiency. It is a process for extracting useful patterns from databases [1]. Data mining can extract the useful patterns desired by its users from different types of databases. Most researchers consider data mining as synonymous with knowledge discovery in databases. Knowledge discovery includes the following steps that are performed sequentially:

    Data refinement: removes clutter and inconsistent data.

    Data integration: combines data sources when necessary.

    Data transformation: transforms data into a suitable form for data mining.

    Data mining: is a necessary process in which intelligent methods of patterns they extract appropriate data.

    Pattern evaluation: evaluates the extracted patterns.

    Knowledge display: In this stage, different knowledge display techniques are used to show the discovered and explored knowledge to the user.

    Increasing the ability of different techniques and tools in creating and collecting data and the importance of databases due to their availability and strength in various industries and research, as well as the wide global network that is known as It is an important source of information, we are faced with a huge amount of data and databases.

    Although search engines have developed many techniques to identify web spam, but web spammers have developed new tactics to influence the results of search engine ranking algorithms, in order to achieve higher ranks.

    Data mining as an important and new tool widely used in identifying spam pages from non-spam.

    Statement of the problem:

    Search engines have become a place to search for information on the web. Considering the phenomenon of spam, the search results are not always favorable.

    For more than two decades, the research on the recovery of hostile information has many interested in academia and industry. Spams have cast a shadow on every information system, e-mail, web and blogs and social networks. This concept was proposed for the first time in 1996 and was soon presented as a challenge for search engines. Recently, all major search engine companies have determined the recovery of hostile information as a high priority due to the numerous and negative effects caused by the appearance of spam [3, 2]. First, spam makes the quality of search results unfavorable and reduces the efficiency that legitimate sites can have in the absence of spam.

    Secondly, it causes a user's lack of confidence in the search engine and ultimately leads to the replacement of the search engine, which will not cost the user.

    The purpose is to determine the different characteristics of web pages in order to rank the results of the search engine, and based on this, the classification is done in order to identify spam sites from valid sites. related to adults and malware and attacks are raised.For example, the ranking of 100 million pages based on page ranking algorithms showed that 11 out of 20 results were pronographic sites that reached this result by manipulating content and links [5, 4]. In the past, this caused a significant amount of computing and storage resources from search engine companies to be wasted. In 2005, the damage caused by spam was estimated at 50 billion dollars. In 2009, it was estimated at 130 billion dollars [6]. Among the new challenges are the rapid growth of the web and its heterogeneity and the simplification of content creation tools (for example, wiki websites, blogging platforms, etc.) and the reduction of website maintenance costs (such as domain registration, web hosting, etc.), which has caused the evolution of spam and the emergence of new strains of web spam that cannot be detected by previous successful methods. They have high ranking results. For 85% of the questions, only the first page of the result was noticed and only three links were clicked [7]. Therefore, the effort to be included in the first page of the search engine result will have a clear economic incentive due to the increase in website traffic. In order to achieve this goal, website owners try to manipulate search engine ranking results. According to the conducted studies, the amount of spam varies from 6 to 22 percent, and this indicates the scope and scope of the problem [9, 8]. Structure of the thesis: According to the subject of the thesis, at the beginning, in the second chapter, the structure of the web and the concepts of spam and types of spam and some of the most important machine learning methods have been investigated. In the third chapter, the available data sets are introduced and the techniques of dealing with link spam, content spam, and link-content spam have been examined. In the fourth chapter, the introduction and selected data set are discussed and the results related to the optimal data mining models are stated. In the fifth chapter, as the final chapter, the final result of the work has been summarized and the issues that can be considered and investigated as the subject of the master's thesis in the future are stated.

    Chapter Two:

    Web and web spam

    In view of the topic of the thesis, it seems necessary to examine the structure of the web and types of spam as well as the most important machine learning algorithms. Therefore, in this chapter, the concepts of the web, then the types of spam, and finally the machine learning algorithms have been examined.

    2-1- The World Wide Web:

    The World Wide Web can be considered as a database managed by humanity for the storage and sharing of various documents, however, the web differs from the usual databases in terms of size, very fast dynamics, and heterogeneity. First, the web is very large, its size cannot be determined and measured accurately. The number of pages is unlimited and the content depends on the information entered by the user. Jolly and Signorini [2] have reported that the web contained 11 billion pages in January 2005 [10]. In 2008, Google claimed that their system processed one trillion URLs on the web] 11.[

    Another challenge is that web content changes rapidly. Chu and Garc?a-Molina [3] have evaluated the speed of change by downloading 720,000 pages in a four-month period in 1999 [12]. They concluded that the page contents were modified for 23% of the collection, and within 50 days, 50% of the collection was edited or removed [13].

    Web documents are heterogeneous for different reasons and different perspectives. In addition to text, websites can contain images, videos, and audio files in various formats. The size of these documents can vary from one byte to thousands of megabytes. Among the most common HTML files, you can find different versions and pages with incorrect syntax that do not follow W3C standards but are still viewable by web browsers[4]. Web content is often unstructured, consists of different languages ??and styles, and varies widely in quality. Although HTML pages contain some metadata[5], they are not generally reliable. The main goal of the semantic web is to put the data in a machine-friendly structured form [6], which makes cooperation between humans and machines possible [14].

  • Contents & References of Detection of web spam using data mining techniques

    List:

    Abstract-1

    Chapter One: Introduction-2

    1-1 Foreword-3

    1-2 Statement of the problem-3

    1-3 Importance and necessity of doing research-4

    Structure of thesis-5

    Chapter II: Web and web spam-6

    2-1 World Wide Web -7

    2-1-1 Web as graph-8

    2-1-2 Web graph on page and host level-8

    2-1-3 Connection-9

    2-2 Search engines-10

    2-2-1 Architecture of web search engines-11

    2-2-2 Search engine query server-13

    2-3 Rank Classification-13

    2-3-1 Content-based ranking-13

    2-3-2 Link-based algorithms-15

    2-4 Web spam-19

    2-4-1 Content spam-20

    2-4-2 Link spam-22

    2-4-3 Secret techniques-27

    2-5 Machine Learning-29

    2-5-1 Na?Ve Bayes    -30

    2-5-2 Decision Tree-31

    2-5-3 Support Vector Machine-33

    2-6 Combining Classifiers-35

    2-6-1 Bagging -35

    2-6-2 Boosting -36

    2-7 evaluation methods -37

    2-7-1 cross evaluation -38

    2-7-2 accuracy and recall-38

    2-7-3 ROC curve -39

    2-8 summary-40

    Chapter three: research background -41

    3-1 data set used By researchers -42

    3-1-1 UK2006 -42

    3-1-2 UK2007 -43

    3-1-3 Data set collected using MSN search -44

    3-1-4 DC2010 -44

    3-2 Content-based studies-47

    3-3 Methods-based on link-51

    3-3-1 Algorithms based on publishing tags-51

    3-3-2 Functional ranking-55

    3-3-3 Link pruning and reweighting algorithms-56

    3-3-4 Algorithms based on filtering tags-57

    3-4 Link and content-based methods-58

    3-4-1 studies based on feature reduction-57

    3-4-2 studies based on combination of classifiers-59

    3-4-3 studies based on testing the importance of different features in spam detection-63

    3-4-4 studies based on web configuration-71

    3-4-5 spam detection through analysis of linguistic models-76

    3-4-6 The impact of page language on web spam detection features-79

    3-4-7 The approach of combining content-based and link-based features for Arabic pages-82

    3-5 Summary-83

    Chapter four: Implementation of the proposed idea-85

    4-1 Introduction-86

    4-2 Features of the selected data set-87

    4-3 Previous Processing-92

    4-3-1 pre-processing of UK2007 data set-93

    4-3-2 reduction of features by applying data mining algorithms-93

    4-4 data mining and evaluation of models-96

    4-4-1 results of algorithms by applying feature reduction methods -102

    4-4-2 comparison of F_measure value obtained from the algorithm by applying on the features obtained from feature reduction algorithms-109

    4-5 interpretation of results-110

    4-6 summary-114

    Chapter five: conclusions and future work-115

    5-1 conclusion-116

    5-2 future work--117

    Resources-118

         Attachment 1-125

        Attachment 2-126

       Attachment 3-126

       Attachment 4-127

       Attachment 5-127

       Attachment 6-128

       Attachment 7-129

      Attachment 8-129

      Attachment 9-129

    Appendix 10-130

    Appendix 11-130

    Appendix 12-131

    Appendix 13-132

    Appendix 14-133

    English abstract-134

    Source:

    [1] Han, J., Kamber, M., 2001, “Data Mining: Concepts and Techniques”, Morgan Kaufman, San Francisco.

    [2] Abernethy, J., Chapelle, O., Castillo., C., Nov.2010, “Graph regularization methods for web spam detection”. Mach. Learn., Vol. 81.

    [3] http://searchengineland.com/ businessweek-dives-deep-into-googles-search-quality-27317, 2011.

    [4] Eiron, N., McCurley, K. S., Tomlin., J. A., 2004, “Ranking the web frontier”, In Proceedings of the 13th International Conference on World Wide Web, WWW'04, New York.

    [5] Page, L., Brin, S., Motwani, R., Winograd., T., 1998, "The pagerank citation ranking: Bringing order to the

    [6] Jennings, R., 2005, “The global economic impact of spam”, Ferris Research.

    [7] Silverstein, C., Marais, H., Henzinger, M., Moricz, M., Sept. 1999, “Analysis of a very large web search engine query log”, SIGIR Forum, 33.

    [8] Bencz´ur, A. A., Csalog´any, K., Sarl´os, T., Uher, M., May 2005, “Spamrank: Fully automatic link spam detection work in progress”, In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05,.

    [9] Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F., 2007, "Know your neighbors: web spam detection using the web topology". In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR' 07, Amsterdam, The Netherlands. [10] Gulli, A., Signorini, A., 2005, "The indexable web is more than 11.5 billion pages", In Proceedings of the 14th World Wide Web Conference (WWW), Special interest tracks and posters, pages 902-903.

    [11] The Official Google Blog, 2008.

    [12] Cho, J., Garcia-Molina, H., 2000, "The evolution of the web and implications for an incremental crawler", In The VLDB Journal, pages 200-209.

    [13] Bar-Yossef, Z., Broder, A. Z., Kumar, R., Tomkins, A., 2004, "Sic transit Gloria telae: Towards an understanding of the web's decay", In Proceedings of the 13th World Wide Web Conference (WWW), pages 328-337. ACM Press. [14] Berners-Lee, T., Hendler, J., Lassila, O., 2001, “The semantic web. Scientific American".

     

    [15] Davison, B. D., 2000, "Recognizing nepotistic links on the web", In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23-28, Austin, TX.

     

    [16] Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J., 2000, "Graph structure in the web", In Proceedings of the 9th World Wide Web Conference (WWW), pages 309-320. North-Holland Publishing Co. [17] Silverstein, C., Marais, H., Henzinger, M., Moricz, M., 1999, “Analysis of a very large web search engine query log” SIGIR Forum, 33(1):6–12, 1999. [18] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S., August 2001, "Searching the web. ACM Transactions on Internet Technology (TOIT)", 1(1):2-43,.

    [19] Brin, S., Page, L., 1998, "The anatomy of a large-scale hypertextual Web search engine", Computer Networks and ISDN Systems, 30(1-7):107-117.

    [20] Risvik, K. M., Michelsen, R., 2002," Search engines and web dynamics", Computer Networks, 39(3):289-302.

    [21] Gyongyi, Z., Garcia-Molina, H., 2004, "Web Spam Taxonomy", Technical Report, Stanford University.

    [22] Baeza-Yates, R., Ribeiro-Neto, B., 1999, “Modern Information Retrieval”, Addison-Wesley, Boston. Taylor Graham Publishing, London, UK. [24] Csalog?ny, K., 2009, “Methods for Web Spam Filtering”, Technical Report, E?tv?s Lor?nd University.

     

    [25] Salton, G., Buckley, C., 1988, “Term-weighting approaches in automatic text retrieval”, Information Processing & Management, 24(5):513-523.

    [26] Page, L., Brin, S., Motwani, R., Winograd, T., 1998, “The PageRank citation ranking: Bringing order to the web”, Technical Report 1999-66, Stanford University.

    [27] Motwani, R., Raghavan, P., 1995, "Randomized Algorithms", Cambridge University Press.

    [28] Brin, S., Page, L., Apr.

Detection of web spam using data mining techniques