Word Files
Reference for Downloading Educational Files

Detection of web spam using data mining techniques

Number of pages: 95 File Format: word File Code: 31058
Year: 2014 University Degree: Master's degree Category: Computer Engineering

Tags/Keywords: data mining - data mining - Data mining techniques - Machine learning algorithms - Search engines - Spam - Web spam detection

Part of the Content
Contents & Resources

Summary of Detection of web spam using data mining techniques

Master Thesis (M.sc)

Abstract:

Nowadays, spam [1] is one of the main problems of search engines, because it makes the quality of search results unfavorable. In recent years, there have been many advances in detecting fake pages, but new spamming techniques have also emerged in response. It is necessary to improve anti-spam techniques to overcome these attacks.

A common problem we encounter in this field is that many documents are ranked high by the search engine when they do not deserve it. Considering the ever-increasing expansion of the web and the emergence of new spam techniques by spammers, the purpose of this thesis is to investigate data mining methods to better identify spam pages from non-spam.

Algorithms and data mining software are among the tools used in this research. The UK2007 standard data set and Waka software have been used to provide optimal models, and an attempt is made to provide models that reduce the features used to identify spam pages from non-spam and provide optimal efficiency. It is a process for extracting useful patterns from databases [1]. Data mining can extract the useful patterns desired by its users from different types of databases. Most researchers consider data mining as synonymous with knowledge discovery in databases. Knowledge discovery includes the following steps that are performed sequentially:

Data refinement: removes clutter and inconsistent data.

Data integration: combines data sources when necessary.

Data transformation: transforms data into a suitable form for data mining.

Data mining: is a necessary process in which intelligent methods of patterns they extract appropriate data.

Pattern evaluation: evaluates the extracted patterns.

Knowledge display: In this stage, different knowledge display techniques are used to show the discovered and explored knowledge to the user.

Increasing the ability of different techniques and tools in creating and collecting data and the importance of databases due to their availability and strength in various industries and research, as well as the wide global network that is known as It is an important source of information, we are faced with a huge amount of data and databases.

Although search engines have developed many techniques to identify web spam, but web spammers have developed new tactics to influence the results of search engine ranking algorithms, in order to achieve higher ranks.

Data mining as an important and new tool widely used in identifying spam pages from non-spam.

Statement of the problem:

Search engines have become a place to search for information on the web. Considering the phenomenon of spam, the search results are not always favorable.

For more than two decades, the research on the recovery of hostile information has many interested in academia and industry. Spams have cast a shadow on every information system, e-mail, web and blogs and social networks. This concept was proposed for the first time in 1996 and was soon presented as a challenge for search engines. Recently, all major search engine companies have determined the recovery of hostile information as a high priority due to the numerous and negative effects caused by the appearance of spam [3, 2]. First, spam makes the quality of search results unfavorable and reduces the efficiency that legitimate sites can have in the absence of spam.

Secondly, it causes a user's lack of confidence in the search engine and ultimately leads to the replacement of the search engine, which will not cost the user.

The purpose is to determine the different characteristics of web pages in order to rank the results of the search engine, and based on this, the classification is done in order to identify spam sites from valid sites. related to adults and malware and attacks are raised.For example, the ranking of 100 million pages based on page ranking algorithms showed that 11 out of 20 results were pronographic sites that reached this result by manipulating content and links [5, 4]. In the past, this caused a significant amount of computing and storage resources from search engine companies to be wasted. In 2005, the damage caused by spam was estimated at 50 billion dollars. In 2009, it was estimated at 130 billion dollars [6]. Among the new challenges are the rapid growth of the web and its heterogeneity and the simplification of content creation tools (for example, wiki websites, blogging platforms, etc.) and the reduction of website maintenance costs (such as domain registration, web hosting, etc.), which has caused the evolution of spam and the emergence of new strains of web spam that cannot be detected by previous successful methods. They have high ranking results. For 85% of the questions, only the first page of the result was noticed and only three links were clicked [7]. Therefore, the effort to be included in the first page of the search engine result will have a clear economic incentive due to the increase in website traffic. In order to achieve this goal, website owners try to manipulate search engine ranking results. According to the conducted studies, the amount of spam varies from 6 to 22 percent, and this indicates the scope and scope of the problem [9, 8]. Structure of the thesis: According to the subject of the thesis, at the beginning, in the second chapter, the structure of the web and the concepts of spam and types of spam and some of the most important machine learning methods have been investigated. In the third chapter, the available data sets are introduced and the techniques of dealing with link spam, content spam, and link-content spam have been examined. In the fourth chapter, the introduction and selected data set are discussed and the results related to the optimal data mining models are stated. In the fifth chapter, as the final chapter, the final result of the work has been summarized and the issues that can be considered and investigated as the subject of the master's thesis in the future are stated.

Chapter Two:

Web and web spam

In view of the topic of the thesis, it seems necessary to examine the structure of the web and types of spam as well as the most important machine learning algorithms. Therefore, in this chapter, the concepts of the web, then the types of spam, and finally the machine learning algorithms have been examined.

2-1- The World Wide Web:

The World Wide Web can be considered as a database managed by humanity for the storage and sharing of various documents, however, the web differs from the usual databases in terms of size, very fast dynamics, and heterogeneity. First, the web is very large, its size cannot be determined and measured accurately. The number of pages is unlimited and the content depends on the information entered by the user. Jolly and Signorini [2] have reported that the web contained 11 billion pages in January 2005 [10]. In 2008, Google claimed that their system processed one trillion URLs on the web] 11.[

Another challenge is that web content changes rapidly. Chu and Garc?a-Molina [3] have evaluated the speed of change by downloading 720,000 pages in a four-month period in 1999 [12]. They concluded that the page contents were modified for 23% of the collection, and within 50 days, 50% of the collection was edited or removed [13].

Web documents are heterogeneous for different reasons and different perspectives. In addition to text, websites can contain images, videos, and audio files in various formats. The size of these documents can vary from one byte to thousands of megabytes. Among the most common HTML files, you can find different versions and pages with incorrect syntax that do not follow W3C standards but are still viewable by web browsers[4]. Web content is often unstructured, consists of different languages ??and styles, and varies widely in quality. Although HTML pages contain some metadata[5], they are not generally reliable. The main goal of the semantic web is to put the data in a machine-friendly structured form [6], which makes cooperation between humans and machines possible [14].
Contents & References of Detection of web spam using data mining techniques

List:

Abstract-1

Chapter One: Introduction-2

1-1 Foreword-3

1-2 Statement of the problem-3

1-3 Importance and necessity of doing research-4

Structure of thesis-5

Chapter II: Web and web spam-6

2-1 World Wide Web -7

2-1-1 Web as graph-8

2-1-2 Web graph on page and host level-8

2-1-3 Connection-9

2-2 Search engines-10

2-2-1 Architecture of web search engines-11

2-2-2 Search engine query server-13

2-3 Rank Classification-13

2-3-1 Content-based ranking-13

2-3-2 Link-based algorithms-15

2-4 Web spam-19

2-4-1 Content spam-20

2-4-2 Link spam-22

2-4-3 Secret techniques-27

2-5 Machine Learning-29

2-5-1 Na?Ve Bayes    -30

2-5-2 Decision Tree-31

2-5-3 Support Vector Machine-33

2-6 Combining Classifiers-35

2-6-1 Bagging -35

2-6-2 Boosting -36

2-7 evaluation methods -37

2-7-1 cross evaluation -38

2-7-2 accuracy and recall-38

2-7-3 ROC curve -39

2-8 summary-40

Chapter three: research background -41

3-1 data set used By researchers -42

3-1-1 UK2006 -42

3-1-2 UK2007 -43

3-1-3 Data set collected using MSN search -44

3-1-4 DC2010 -44

3-2 Content-based studies-47

3-3 Methods-based on link-51

3-3-1 Algorithms based on publishing tags-51

3-3-2 Functional ranking-55

3-3-3 Link pruning and reweighting algorithms-56

3-3-4 Algorithms based on filtering tags-57

3-4 Link and content-based methods-58

3-4-1 studies based on feature reduction-57

3-4-2 studies based on combination of classifiers-59

3-4-3 studies based on testing the importance of different features in spam detection-63

3-4-4 studies based on web configuration-71

3-4-5 spam detection through analysis of linguistic models-76

3-4-6 The impact of page language on web spam detection features-79

3-4-7 The approach of combining content-based and link-based features for Arabic pages-82

3-5 Summary-83

Chapter four: Implementation of the proposed idea-85

4-1 Introduction-86

4-2 Features of the selected data set-87

4-3 Previous Processing-92

4-3-1 pre-processing of UK2007 data set-93

4-3-2 reduction of features by applying data mining algorithms-93

4-4 data mining and evaluation of models-96

4-4-1 results of algorithms by applying feature reduction methods -102

4-4-2 comparison of F_measure value obtained from the algorithm by applying on the features obtained from feature reduction algorithms-109

4-5 interpretation of results-110

4-6 summary-114

Chapter five: conclusions and future work-115

5-1 conclusion-116

5-2 future work--117

Resources-118

     Attachment 1-125

    Attachment 2-126

   Attachment 3-126

   Attachment 4-127

   Attachment 5-127

   Attachment 6-128

   Attachment 7-129

Attachment 8-129

Attachment 9-129

Appendix 10-130

Appendix 11-130

Appendix 12-131

Appendix 13-132

Appendix 14-133

English abstract-134

Source:

[1] Han, J., Kamber, M., 2001, “Data Mining: Concepts and Techniques”, Morgan Kaufman, San Francisco.

[2] Abernethy, J., Chapelle, O., Castillo., C., Nov.2010, “Graph regularization methods for web spam detection”. Mach. Learn., Vol. 81.

[3] http://searchengineland.com/ businessweek-dives-deep-into-googles-search-quality-27317, 2011.

[4] Eiron, N., McCurley, K. S., Tomlin., J. A., 2004, “Ranking the web frontier”, In Proceedings of the 13th International Conference on World Wide Web, WWW'04, New York.

[5] Page, L., Brin, S., Motwani, R., Winograd., T., 1998, "The pagerank citation ranking: Bringing order to the

[6] Jennings, R., 2005, “The global economic impact of spam”, Ferris Research.

[7] Silverstein, C., Marais, H., Henzinger, M., Moricz, M., Sept. 1999, “Analysis of a very large web search engine query log”, SIGIR Forum, 33.

[8] Bencz´ur, A. A., Csalog´any, K., Sarl´os, T., Uher, M., May 2005, “Spamrank: Fully automatic link spam detection work in progress”, In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05,.

[9] Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F., 2007, "Know your neighbors: web spam detection using the web topology". In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR' 07, Amsterdam, The Netherlands. [10] Gulli, A., Signorini, A., 2005, "The indexable web is more than 11.5 billion pages", In Proceedings of the 14th World Wide Web Conference (WWW), Special interest tracks and posters, pages 902-903.

[11] The Official Google Blog, 2008.

[12] Cho, J., Garcia-Molina, H., 2000, "The evolution of the web and implications for an incremental crawler", In The VLDB Journal, pages 200-209.

[13] Bar-Yossef, Z., Broder, A. Z., Kumar, R., Tomkins, A., 2004, "Sic transit Gloria telae: Towards an understanding of the web's decay", In Proceedings of the 13th World Wide Web Conference (WWW), pages 328-337. ACM Press. [14] Berners-Lee, T., Hendler, J., Lassila, O., 2001, “The semantic web. Scientific American".

[15] Davison, B. D., 2000, "Recognizing nepotistic links on the web", In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23-28, Austin, TX.

[16] Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J., 2000, "Graph structure in the web", In Proceedings of the 9th World Wide Web Conference (WWW), pages 309-320. North-Holland Publishing Co. [17] Silverstein, C., Marais, H., Henzinger, M., Moricz, M., 1999, “Analysis of a very large web search engine query log” SIGIR Forum, 33(1):6–12, 1999. [18] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S., August 2001, "Searching the web. ACM Transactions on Internet Technology (TOIT)", 1(1):2-43,.

[19] Brin, S., Page, L., 1998, "The anatomy of a large-scale hypertextual Web search engine", Computer Networks and ISDN Systems, 30(1-7):107-117.

[20] Risvik, K. M., Michelsen, R., 2002," Search engines and web dynamics", Computer Networks, 39(3):289-302.

[21] Gyongyi, Z., Garcia-Molina, H., 2004, "Web Spam Taxonomy", Technical Report, Stanford University.

[22] Baeza-Yates, R., Ribeiro-Neto, B., 1999, “Modern Information Retrieval”, Addison-Wesley, Boston. Taylor Graham Publishing, London, UK. [24] Csalog?ny, K., 2009, “Methods for Web Spam Filtering”, Technical Report, E?tv?s Lor?nd University.

[25] Salton, G., Buckley, C., 1988, “Term-weighting approaches in automatic text retrieval”, Information Processing & Management, 24(5):513-523.

[26] Page, L., Brin, S., Motwani, R., Winograd, T., 1998, “The PageRank citation ranking: Bringing order to the web”, Technical Report 1999-66, Stanford University.

[27] Motwani, R., Raghavan, P., 1995, "Randomized Algorithms", Cambridge University Press.

[28] Brin, S., Page, L., Apr.

How To Access The File

Presenting a feature-based model to analyze the sentiment in texts

Number of pages: 74 Category: Computer Engineering

Master's Thesis in Computer Engineering (Software) First Chapter Preface 1-1- Introduction Some authors define data mining as a tool to search for useful information in a large amount of data. To perform the data mining process, we encounter various research fields, such as database, machine learning and statistics. Databases are essential for analyzing large amounts of data. ...

Presenting a model to identify the influencing factors and their impact factor in the profit and loss of the third party car insurance of insurance companies by means of data mining methods, a case study of Iran Insurance Company.

Number of pages: 100 Category: Computer Engineering

Master's thesis in the field of computer - software engineering. Abstract: The review of car insurance information has shown that factors such as the type of car used, having a driver's license, the type of license and its compatibility or non-compatibility with the vehicle, the amount of the insurance premium, the amount of insurance policy obligations, the quality of the car ...

Development of web mining techniques in order to personalize information in search engines

Number of pages: 190 Category: Computer Engineering

Master's Thesis of Computer Software Engineering (M.Sc) Abstract The dynamic nature of the global network and its growing dimensions have made accurate information retrieval difficult. Incorrect answers returned by search engines, especially for query terms with different meanings, have caused the dissatisfaction of web users who need accurate answers to their information ...

Presenting a peer-to-peer (P2P) botnet detection method based on cluster similarity

Number of pages: 61 Category: Computer Engineering

Dissertation for Master's Degree (M.S.c) Field: Computer Orientation: Software Abstract Today, the use of botnets as a tool for criminal activities with a large scope in computer networks against large targets such as a country has greatly increased. Bot is a distributed environment that is used for various attacks with a large volume. For this reason, the detection of this type ...

Assessment of transient stability of power systems using data from phasor measurement units

Number of pages: 123 Category: Electronic Engineering

Master's Thesis in the field of electrical engineering- control, abstract assessment of the transient stability of power systems using the data of phasor measurement units, quick assessment of security in power networks in emergency situations and the occurrence of various errors, is a vital thing to prevent collapse and create nationwide outages. Therefore, the evaluation ...

Consensus clustering on heterogeneous distributed data

Number of pages: 120 Category: Computer Engineering

Master's Thesis in Computer Engineering - Software Orientation Abstract Clustering can be considered one of the most important steps in data analysis. Many clustering methods have been developed and presented so far. One of these methods that has been studied in recent studies is consensus clustering method. The goal of consensus clustering is to combine several initial ...

Optimizing the link importance detection method in the link database and its application in the architecture of search engines

Number of pages: 119 Category: Computer Engineering

Master's Thesis of Computer-Software Engineering (M.Sc) Abstract In the age of information, the web has become one of the most powerful and fastest means of communication and interaction between people. Search engines as web applications automatically navigate the web and receive a set of available documents. The process of receiving, storing, classifying and indexing is done ...

Creating a recommender system on the web using user profiles and machine learning methods

Number of pages: 85 Category: Computer Engineering

Computer Engineering Master's Thesis Abstract Web development that lacks an integrated structure creates many problems for users. Not finding the information needed by users in this huge warehouse is one of the problems of web users. In order to deal with these problems, web personalization systems have been provided, which by finding the behavior patterns of users without their ...

Fuzzy clustering of data based on fuzzy logic

Number of pages: 46 Category: Computer Engineering

Dissertation for Master's Degree in Computer Engineering - Artificial Intelligence Abstract Data clustering is a method for classifying similar data, which has been used for many years in various sciences and many algorithms have been designed in this field. Recent clustering research leads to hybrid methods that are more robust and accurate. Hybrid clustering tries to first ...

Assessment of transient stability of power systems using data from phasor measurement units

Number of pages: 122 Category: Electrical Engineering

Master's Thesis in the field of Electrical Engineering - Control Abstract Evaluation of the transient stability of power systems using the data of phasor measurement units by the efforts of Hanieh Mohammadi Rapid evaluation of security in power networks in emergency situations and the occurrence of various errors is a vital thing to prevent collapse and create nationwide ...

Detection of web spam using data mining techniques

Summary of Detection of web spam using data mining techniques

Contents & References of Detection of web spam using data mining techniques