Optimizing the link importance detection method in the link database and its application in the architecture of search engines

Number of pages: 119 File Format: word File Code: 31052
Year: 2014 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Optimizing the link importance detection method in the link database and its application in the architecture of search engines

    Computer-Software Engineering Master Thesis (M.Sc)

    Abstract

    In the information age, the web has become one of the most powerful and fastest means of communication and interaction between people. Search engines as web applications automatically navigate the web and receive a set of available documents. The process of receiving, storing, classifying and indexing is done automatically based on semi-intelligent algorithms. Although many facts about the structure of these applications remain hidden as trade secrets, the research literature in the field of search engines and information retrieval tools is trying to find the best solutions for the optimal performance of each module in the structure of search engines. Due to the limited time of today's web users, providing them with the most relevant and up-to-date documents is often the most important challenge for search engines. To do this, every module in the search engine architecture should be intelligently designed to not only provide relevant documents but also respond as quickly as possible. Among these modules, there is a sensitive and vital part called crawler. One of the debatable issues in search engine performance optimization is to reconfigure the crawling policy in such a way that relevant external links that link to content related to the source pages are followed. The crawler module is responsible for fetching pages for the ranking module. If higher quality pages with less subject deviation are indexed by the crawler, the ranking will be done faster.

    Considering the structure of the web as a graph, the way to navigate the web is in the form of graph search methods. In this research, by experimentally applying different graph search methods and their various combinations, and by issuing queries to Google search engine to measure the quality of received pages and by keeping the navigation depth factor constant, the best method with reasonable temporal and spatial complexity will be identified to be used in the crawler section of the search engine architecture.

    Key words: web crawler, graph navigation, search engines, topic deviation.

    Introduction

    Without search engines, the World Wide Web is almost useless. But the question is how search engines find the information we need among all these websites. The internet is very vast and the web users are estimated to be around two billion. Among them, there are at least 250 million Internet websites, which contain a total of about 30 billion web pages. Searching the web[1] when it was very small and there were very few websites was usually reserved for researchers and university professors and it can be said that it was considered a difficult task [9]. Meanwhile, it was in the early nineties that the first search engines called Archi[2] appeared. In the first step, a search engine must collect and classify information before displaying results to the user. Therefore, search engines should crawl as many websites as possible and store and classify page addresses with a summary of the page's contents. This task is very heavy and is performed by web crawlers[3][53].

    These programs automatically search the web and save the contents of web pages for later analysis. Since the number of pages and their volume is very high, this work is done on a very large scale and requires high bandwidth and time. Popular search engines have created a huge repository of web pages, but newer crawlers have to start from scratch. To begin with, crawlers usually go to famous directories, because through them they can access a large list of related sites, and by browsing these websites, the web crawler dives deeper into the internal space of the websites and obtains more information. All this information is stored in the repository to be analyzed later[44].

    A well-designed crawler can browse the content of web pages at high speed, while all crawlers search the web with the help of a coordinating program.

    A well-designed crawler can crawl the content of web pages at a high speed, while all crawlers search the web with the help of a coordinating program so that this process does not repeat itself. This coordinator causes the freshness factor of the pages to be maintained so that their latest version is included in the search engine database[46].

    After the crawlers collect the information on the web pages, this information must be stored on the searcher's site servers. Storing and indexing the countless pages on the web is a big challenge, but even more important is that the search engine knows what its users are looking for. The more information displayed by a search engine matches the phrase searched by the user, the better the performance and popularity of the search engine.

    But what makes a website rank higher in the search results of a search engine is actually the type of search engine algorithm found in the page ranking. This algorithm is a complex set of various rules and considerations, which, of course, is constantly being optimized to show better results to users. The better the algorithm of a search engine works, the better results the website will provide to users, and therefore the success of a search engine is guaranteed by its architecture and the type of search algorithm. Search engines all evaluate entire pages based on the words they contain. The importance of a website has an important effect on its ranking, and if many sites link to a certain page, the search engine will realize that that page is important and pay more attention to that page. The more links from other sites to a site, the more important and authoritative that website is.

    Now if a website that has a high ranking links to another website, that link will be more valuable than multiple links[35].

    1-2 Statement of the problem

    A web crawler is a program that downloads web pages, generally for a web search engine. Crawlers of major search engines such as Google, Altavista and . A significant portion of text web pages are used to build content indexes. Other crawlers may also view many pages and only use certain types of information, such as email addresses. At the other end of the spectrum, there are personalized crawlers that crawl a particular user's favorite pages to build a quick-access cache. Designing a good crawler presents many challenges due to the vastness of the web and must be constantly updated. According to various studies there are more than one million pages available on the web and this growth rate is expected to continue. Besides, newly created pages are constantly being updated [5]. The difficulty of implementing an efficient web crawler clearly states that the bandwidth for crawling is neither infinite nor free. Therefore, it is necessary to crawl the web not only at a scale, but also in an efficient way so that an acceptable level of quality or freshness of the web pages is maintained, so the administrator of a web crawler must define its behavior. Therefore, the crawler must determine what algorithm it uses to download pages with higher quality and how the pages are selected to update and avoid creating overhead in websites.

    According to the current size of the web, it is necessary for the crawler to perform crawling on a fraction of the web that has higher content quality. Even today's major search engines crawl only a fraction of the pages on the web, but the crawler must crawl a fraction of the pages that are relevant to the topic, not just random pages, i.e. pages must be selected based on their importance. The importance of a web page depends on the number of links or their visits [23]. In order to be able to visit pages according to their importance, the web crawler must be able to use a good and strong strategy to detect the quality of the pages. In this research, to choose a suitable strategy, all strategies of graph navigation and crawling were tested. This research, while examining the various methods available in recognizing the importance of links, has provided a solution and an algorithm in order to optimize the methods of recognizing the importance of links.

  • Contents & References of Optimizing the link importance detection method in the link database and its application in the architecture of search engines

    List:

    Abstract 1

    Chapter One: General. 2

    1-1 Introduction. 3

    1-2 statement of the problem. 4

    1-3 The importance and necessity of conducting research. 5

    1-4 thesis structure. 6

    Chapter Two: Fundamentals and Basic Concepts 7

    2-1 Introduction. 8

    2-2 types of search engines. 13

    2-2-1 Keyword Engines. 13

       2-2-2 Search Engines by Subject Directory. 13

       2-2-3 Crawler-based search engines 15

           2-2-3-1 Difference between directory engines and crawler-based engines 16

       2-2-4  Hybrid search engines. 16

       2-2-5 Meta search engines 17 

           2-2-5-1 List of search engines. 17

       2-2-5-2 Sequential search. 17

           2-2-5-3 simultaneous search. 17

       2-2-6 Smart search engines. 18

       2-2-7 Cost-Based Search Engines. 18

    2-3 architecture of search engines. 20

    2-4 architectural components of search engines. 22

    2-5 Repository Update Strategies. 27

       2-5-1 batch method or permanent crawler. 27

       2-5-2 Partial or full searches. 32

    2-6 The two main profiles of the indexing unit. 28

    2-7 An example of how the search engine works. 31

    2-8 steps of search engines. 31

       2-8-1 Data preprocessing 31

       2-8-2 Prioritization of results. 32

    2-9 Tags 33

       2-9-1 Descriptive text tags. 33

    2-9-2- on the alt tag. 33

    2-10 robots.txt file 34

    2-11 position and distance. 34

    2-12 crawler problems 35

    2-13 search engine optimization methods. 35

    2-13-1 Indexing. 35

       2-13-2 Prevention of creep and the standard of exiting robots 35  

       2-13-3 Increasing importance. 36

    2-14 ranking algorithms. 37

       2-14-1 Rating parameters. 37

    2-14-2 Giving weight to words. 37

    2-14-3 Evaluation of keywords. 37

       2-14-4 Weighting parameters. 38

       2-14-5 Tolerable recovery. 38

       2-14-6 general algorithm for finding spelling errors in search engines. 38

    2-14-7 spelling mistakes. 39

       2-14-8 Editing distance algorithm. 39

    2-14-9 K-gram proximity algorithm. 40

       2-14-10 Context-sensitive error detection. 40

       2-14-11 The concept of connection. 41

           2-14-11-1 Relevance from the user's point of view. 42

       2-14-11-2 Relevance in terms of recovery system. 42

       2-14-12 Asking the user's opinion in the rating. 43

       2-14-13 Major Search Engines. 43

       2-14-13-1 Google. 43

    2-14-13-2 Excite. 44

    2-14-13-3 Altavista. 44

       2-14-13-4 Yahoo. 44 2-14-13-5 Fast 44 2-14-13-6 Lycos 44 2-14-14 News search engines. 45

       2-14-15 Metacrawler. 46

       2-14-16 Profitable search engines. 48

       2-14-17 Payment List Search Engines. 49

       2-14-18 Proprietary search engines. 49

       2-14-19 Search for answers. 50

       2-14-20 Children's search engines. 51    

       2-14-21 Regional Search Engines. 51

       2-15 Conclusion. 52

    Chapter 3: Web crawling architecture and crawling strategies. 53

    3-1 Introduction. 54

    3-2 Architecture of web crawlers. 54

    3-3 page selection. 56

    3-4 The importance of the page. 57

    3-5 Challenges of running a crawler 57

     

    3-5-1 Selecting pages to download. 57

    3-5-1 Selecting pages to download. 57

     

    3-6 Complexities of the crawling process. 58

       3-6-1 measurement strategies for choosing pages. 58

     

          3-6-1-1 Criteria based on users' tendencies. 58

          3-6-1-2 Criteria based on page reputation. 58

     

          3-6-1-3 criteria based on the location of pages. 58

    3-7 How to start and end the process of extracting and storing web pages. 59

       3-7-1 Creep and stop. 59

       3-7-2 Threshold value-based creep and stop. 59

    3-8 strategies for updating pages. 60

    3-8-1 Integrated Update Policy. 60

       3-8-2 Update Policy. 60

       3-8-2 Relative Update Policy. 60

    3-9 Minimizing the Load on Visited Websites 60

    3-10 Parallelizing the Crawler Process 60

    3-11 Web Structure. 61

    3-12 Creep Strategies. 62

        3-12-1 Uninformed Search. 62

            3-12-1-1 depth first move. 62

            3-12-1-2 first level movement. 63

           3-12-1-3 Uniform cost search. 65

       3-12-2 Informed or exploratory search. 66

           3-12-2-1 best-start move. 67

           3-12-2-2 Search * A. 69

       3-12-3 Local search. 69

           3-12-3-1 Hill Climb Search. 70

           3-12-3-2 Local beam search. 70

           3-12-3-3 Heat simulation search. 71

           3-12-3-4 Acceptance threshold algorithm. 72

           3-12-3-2 Local beam search. 70

    3-13 Conclusion. 73

    Chapter four: Analysis of research results. 74

    4-1 Introduction. 75

    4-2 First step: Checking the first level method. 75

    4-3 The second step: checking the first depth method. 80

    4-4 The third step: examining the combined method. 86

       4-4-1 1st formation: 1st level navigation as BFS. 86

       4-4-2 2nd formation: navigation of the first and second level as BFS. 86

       4-4-3 3rd combination: navigation of the first, second and third levels as BFS. 86

    4-5 Step Four: Check the best-start method. 86

    4-6 The fifth step: Examining the hill climbing method. 87

    4-7 Experimental results obtained 88

    4-8 The number of pages downloaded for each query. 90

    4-9 Conclusion. 91

    Chapter five: conclusion and suggestions. 97

    5-1 Conclusion and final summary. 93

    5-2 Suggestions and future work 100

    Resources. 101

     

    Source:

    Persian sources

    Aristopour, S., 1385, "Khezandeh and Web Structure", Library and Information Magazine, Volume 9, Number 2, pp. 4-15.

    Ismaili, M. Tawakli, Hashemi Majed, S., 2013, "Web crawlers", APA specialized laboratory in the field of information and communication technology security, document number: APA_FUM_W_WEB_0111, pp. 5-28.

    Anuri, S., 2015, "Investigation of search engines and comparison of the Pag Rank algorithm with the "HITS" algorithm, the first conference on intelligent computer systems and their applications. p. 2-7.

    Latin sources

    Ahmadi-Abkenari, F and Selamat, A, 2012, “An Architecture for a Focused Trend Parallel Web Crawler with the Application of Clickstream Analysis”, International Journal of Information Sciences, Vol. 184, pp: 266-281.

    Ahmadi-Abkenari, F and Selamat, A, 2013, “Advantages of Employing LogRank Web Page Importance Metric in Domain Specific Web Search Engines”, JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 7, No. 9, pp: 425-432. 6, No.1, pp: 200-207.

    Arasu, A, Cho, J, Garcia-Molina, H, Paepcke, A and Raghavan, S, 2001, “Searching the Web”, ACM Transactions on Internet Technology, Vol. 1, No. 1, pp: 2–43.

    Baeza-Yates, R, Castillo, C, Marin, M and Rodriguez, A, 2005, “Crawling a country: Better strategies than breadth-first for Web page ordering”, In Proceedings of the 14th international conference on World Wide Web/ Industrial and Practical Experience Track, Chiba, Japan, ACM Press, pp: 864– 872.

    Baeza-Yates, R, Carlos, C and Jean, F.S, 2004, "Web Dynamics, Structure, and Page Quality", In Mark Levene and Alex Poulovassilis (editors), Web Dynamics Springer Verlag, pp: 93-109.

    Brin, S and Page, L, 1998, "The Anatomy of a Large-Scale Hypertextual Web Search Engine", International Journal of Computer Networks, vol. 30, Issue. 1-7, pp: 107-117.

    Brandman, Onn, Cho, J and Garcia-Molina, H, 2000, “Crawler Friendly Servers”, In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS), Santa Clara, California, Vol. 28, Issue. 2, pp: 9-14.

Optimizing the link importance detection method in the link database and its application in the architecture of search engines