Contents & References of Detection of web spam using data mining techniques
List:
Abstract-1
Chapter One: Introduction-2
1-1 Foreword-3
1-2 Statement of the problem-3
1-3 Importance and necessity of doing research-4
Structure of thesis-5
Chapter II: Web and web spam-6
2-1 World Wide Web -7
2-1-1 Web as graph-8
2-1-2 Web graph on page and host level-8
2-1-3 Connection-9
2-2 Search engines-10
2-2-1 Architecture of web search engines-11
2-2-2 Search engine query server-13
2-3 Rank Classification-13
2-3-1 Content-based ranking-13
2-3-2 Link-based algorithms-15
2-4 Web spam-19
2-4-1 Content spam-20
2-4-2 Link spam-22
2-4-3 Secret techniques-27
2-5 Machine Learning-29
2-5-1 Na?Ve Bayes -30
2-5-2 Decision Tree-31
2-5-3 Support Vector Machine-33
2-6 Combining Classifiers-35
2-6-1 Bagging -35
2-6-2 Boosting -36
2-7 evaluation methods -37
2-7-1 cross evaluation -38
2-7-2 accuracy and recall-38
2-7-3 ROC curve -39
2-8 summary-40
Chapter three: research background -41
3-1 data set used By researchers -42
3-1-1 UK2006 -42
3-1-2 UK2007 -43
3-1-3 Data set collected using MSN search -44
3-1-4 DC2010 -44
3-2 Content-based studies-47
3-3 Methods-based on link-51
3-3-1 Algorithms based on publishing tags-51
3-3-2 Functional ranking-55
3-3-3 Link pruning and reweighting algorithms-56
3-3-4 Algorithms based on filtering tags-57
3-4 Link and content-based methods-58
3-4-1 studies based on feature reduction-57
3-4-2 studies based on combination of classifiers-59
3-4-3 studies based on testing the importance of different features in spam detection-63
3-4-4 studies based on web configuration-71
3-4-5 spam detection through analysis of linguistic models-76
3-4-6 The impact of page language on web spam detection features-79
3-4-7 The approach of combining content-based and link-based features for Arabic pages-82
3-5 Summary-83
Chapter four: Implementation of the proposed idea-85
4-1 Introduction-86
4-2 Features of the selected data set-87
4-3 Previous Processing-92
4-3-1 pre-processing of UK2007 data set-93
4-3-2 reduction of features by applying data mining algorithms-93
4-4 data mining and evaluation of models-96
4-4-1 results of algorithms by applying feature reduction methods -102
4-4-2 comparison of F_measure value obtained from the algorithm by applying on the features obtained from feature reduction algorithms-109
4-5 interpretation of results-110
4-6 summary-114
Chapter five: conclusions and future work-115
5-1 conclusion-116
5-2 future work--117
Resources-118
Attachment 1-125
Attachment 2-126
Attachment 3-126
Attachment 4-127
Attachment 5-127
Attachment 6-128
Attachment 7-129
Attachment 8-129
Attachment 9-129
Appendix 10-130
Appendix 11-130
Appendix 12-131
Appendix 13-132
Appendix 14-133
English abstract-134
Source:
[1] Han, J., Kamber, M., 2001, “Data Mining: Concepts and Techniques”, Morgan Kaufman, San Francisco.
[2] Abernethy, J., Chapelle, O., Castillo., C., Nov.2010, “Graph regularization methods for web spam detection”. Mach. Learn., Vol. 81.
[3] http://searchengineland.com/ businessweek-dives-deep-into-googles-search-quality-27317, 2011.
[4] Eiron, N., McCurley, K. S., Tomlin., J. A., 2004, “Ranking the web frontier”, In Proceedings of the 13th International Conference on World Wide Web, WWW'04, New York.
[5] Page, L., Brin, S., Motwani, R., Winograd., T., 1998, "The pagerank citation ranking: Bringing order to the
[6] Jennings, R., 2005, “The global economic impact of spam”, Ferris Research.
[7] Silverstein, C., Marais, H., Henzinger, M., Moricz, M., Sept. 1999, “Analysis of a very large web search engine query log”, SIGIR Forum, 33.
[8] Bencz´ur, A. A., Csalog´any, K., Sarl´os, T., Uher, M., May 2005, “Spamrank: Fully automatic link spam detection work in progress”, In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb'05,.
[9] Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F., 2007, "Know your neighbors: web spam detection using the web topology". In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR' 07, Amsterdam, The Netherlands. [10] Gulli, A., Signorini, A., 2005, "The indexable web is more than 11.5 billion pages", In Proceedings of the 14th World Wide Web Conference (WWW), Special interest tracks and posters, pages 902-903.
[11] The Official Google Blog, 2008.
[12] Cho, J., Garcia-Molina, H., 2000, "The evolution of the web and implications for an incremental crawler", In The VLDB Journal, pages 200-209.
[13] Bar-Yossef, Z., Broder, A. Z., Kumar, R., Tomkins, A., 2004, "Sic transit Gloria telae: Towards an understanding of the web's decay", In Proceedings of the 13th World Wide Web Conference (WWW), pages 328-337. ACM Press. [14] Berners-Lee, T., Hendler, J., Lassila, O., 2001, “The semantic web. Scientific American".
[15] Davison, B. D., 2000, "Recognizing nepotistic links on the web", In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23-28, Austin, TX.
[16] Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J., 2000, "Graph structure in the web", In Proceedings of the 9th World Wide Web Conference (WWW), pages 309-320. North-Holland Publishing Co. [17] Silverstein, C., Marais, H., Henzinger, M., Moricz, M., 1999, “Analysis of a very large web search engine query log” SIGIR Forum, 33(1):6–12, 1999. [18] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S., August 2001, "Searching the web. ACM Transactions on Internet Technology (TOIT)", 1(1):2-43,.
[19] Brin, S., Page, L., 1998, "The anatomy of a large-scale hypertextual Web search engine", Computer Networks and ISDN Systems, 30(1-7):107-117.
[20] Risvik, K. M., Michelsen, R., 2002," Search engines and web dynamics", Computer Networks, 39(3):289-302.
[21] Gyongyi, Z., Garcia-Molina, H., 2004, "Web Spam Taxonomy", Technical Report, Stanford University.
[22] Baeza-Yates, R., Ribeiro-Neto, B., 1999, “Modern Information Retrieval”, Addison-Wesley, Boston. Taylor Graham Publishing, London, UK. [24] Csalog?ny, K., 2009, “Methods for Web Spam Filtering”, Technical Report, E?tv?s Lor?nd University.
[25] Salton, G., Buckley, C., 1988, “Term-weighting approaches in automatic text retrieval”, Information Processing & Management, 24(5):513-523.
[26] Page, L., Brin, S., Motwani, R., Winograd, T., 1998, “The PageRank citation ranking: Bringing order to the web”, Technical Report 1999-66, Stanford University.
[27] Motwani, R., Raghavan, P., 1995, "Randomized Algorithms", Cambridge University Press.
[28] Brin, S., Page, L., Apr.