Investigating dynamic data replication algorithms in grid networks and presenting a new algorithm based on parameters of file size, available bandwidth and geographical distance.

Number of pages: 81 File Format: word File Code: 31037
Year: 2014 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Investigating dynamic data replication algorithms in grid networks and presenting a new algorithm based on parameters of file size, available bandwidth and geographical distance.

    Computer Engineering Masters Thesis

    Software Orientation

    Abstract

    The necessity of increasing use of distributed data in computer networks is clear to everyone. A large number of computing and storage resources are placed together and form the grid. In recent years, grid technology has grown significantly, so that it has been used in most researches and scientific experiments. The big challenges in the data grid are the need for high availability, efficiency and low bandwidth consumption. Data replication is a method that can be used to solve problems such as efficient data access or high availability. In an environment where replication is used, by increasing the number of replicated copies of files with better locality of data, the efficiency of the system will be improved.

    In this thesis, different methods of dynamic data replication in data grid networks are investigated and a dynamic data replication algorithm is proposed in the grid, which by taking advantage of the effective factors on data replication, reduces the execution time of tasks and reduces bandwidth consumption and the cost of maintaining versions. This algorithm has been implemented in Optorsim simulator, and the simulation results show that parameters such as the average execution time, the number of replicas, and productivity have improved.

    Key words: data grid, data replication, replacement, access pattern, geographical distance, access cost

    Chapter One

    Introduction

    Introduction

    Over time, various types of distributed systems [1] have been designed and implemented, one of the types of distributed systems is Grid systems [2]. This technology is characterized by its focus on large-scale resource sharing. Data replication is a data grid service that was created to facilitate and speed up data access. 

    2-1. Statement of the problem

    Today, in various fields, big data sets are becoming an important part of shared resources. In various fields, including physical energy, bioinformatics, earth observations, global climate changes, image processing and data mining, a huge amount of data is measured in terabytes and in some cases in betabytes. Such a huge amount of information can be accessed by researchers and scientists using sophisticated computing devices. These researchers and computing and storage devices are distributed all over the world.

    The huge amount of information and calculations creates new problems regarding data access, processing and distribution, and with a large amount of data, different geographical locations and complex calculations are involved, which makes it difficult to face the challenge of management infrastructure. Data Grid is a suitable solution for all the mentioned problems. Grid is an architecture for distributed management and analysis of scientific data sets.

    A large number of computing and storage resources are placed together and form Grid. The main topic and issue that caused the formation of Grid technology was to share resources in a coordinated manner and solve issues and problems in dynamic and multi-institutional virtual organizations. The purpose of sharing was not only the exchange of simple files, but the goal of direct access to computers, software, data and other available resources. Grid provides easy access to all these resources.

     

    3-1. The importance of data grid

    The main motivation for designing data grid was to respond to the needs of users with a large amount of data, cover users and distributed resources, and be responsive to analyzes with a high volume of calculations [1].

    Effective access to such a large amount of data that is widely distributed is slow due to network delays and bandwidth problems. As the size of a grid grows, the complexity of this system increases. The big challenge that arises in the data grid is the need for high availability, efficiency and saving in network traffic. The data grid is designed to meet the needs of large data sets, geographical distribution of users and resources, and computational analysis. This architecture is also developed for complex operations in large areas and heterogeneous environments.Managing such a large amount of distributed data in a centralized method is not efficient because a large amount of load is imposed on the central server. In addition to the fact that the storage is done on the central server, it also has problems such as failure at one point and bottleneck. Therefore, this large amount of information must be repeated and distributed in different places of the distribution system in order to avoid such issues and problems. The grid retrieves data from the nearest site and replicates it to the requesting sites.

    With the help of the data grid, large amounts of data can be stored and then retrieved at different points throughout the grid. In this case, the efficiency of the grid depends on the available bandwidth and network time delay, so that the low bandwidth between the data storage location and the processing location causes grid inefficiency.

    4-1. Possible solutions

    As we know, the data access time depends on the communication bandwidth in the data grid. In a communication environment, the main factor to ensure fast access to data is the absence of high latency. To reduce the access time, various solutions are used, among these solutions, the use of work scheduling can be mentioned. A good scheduler can reduce data transfer costs as much as possible for faster access by running the job in the right place. Another solution is to use the replication mechanism[3], which increases the access speed by creating copies[4] of a copy, in fact, to increase efficiency, multiple copies of files can be stored across the grid[2].

    5-1. Proposed solution

    In fact, the complexity of the structure increases with the growth of the grid size. High data availability is a major challenge in Grid. Users' computing applications contain enormous amounts of data. Local storage of a copy of the data is very expensive and impractical. Coping with network delays and storage capacity limitations[5] at different sites to provide high availability is a difficult challenge. To answer the access challenge, data replication is one of the major methods that promotes high availability, bandwidth consumption, increased fault tolerance, and improved scalability and response time [3-9]. When data is replicated, a copy of the data files is placed in different locations of the data grid, replication can save a large amount of bandwidth compared to having the data only reside at one site. Therefore, to ensure constant and fast access to data, data replication is a very good trade-off between available memory and available bandwidth [10]. Data replication is a common way to improve the efficiency of data access in distributed systems. Creating duplicates not only reduces bandwidth consumption, but also reduces access latency. In other words, increasing the efficiency of data reading from appropriate nodes [6] is the main goal of data replication algorithms.

    In addition, it is possible to increase data access, reliability, system scalability, load balance by performing replication and queuing them among different sites [11].

    The main benefits of replication are: [12]

    1. Better availability: When one node fails, the system can access data from another node, which also improves availability.

    2. Better performance: Because the data is replicated among multiple nodes, the user can obtain the data from the nearest node or the node with less workload. Data replication techniques can be classified into two main parts, static replication [7] and dynamic replication [8]. In static replication, the number of copies and the host node are statically selected at the beginning, and no more copies are created after that. On the other hand, the dynamic strategy can create the copy in a new node according to the storage capacity and bandwidth or adapt itself to the changes and delete the copies that are no longer needed according to the requests. In static data replication, a duplicate copy exists until it is deleted by the user or its lifetime expires. The defect of static replication is when the access pattern of nodes changes frequently and static algorithms are not able to adapt to new conditions. Once a copy is created on a site, it remains there until it is deleted by the user.

  • Contents & References of Investigating dynamic data replication algorithms in grid networks and presenting a new algorithm based on parameters of file size, available bandwidth and geographical distance.

    List:

    Chapter 1. Introduction. 2

    1-1. Introduction. 3

    2-1. State the problem. 3

    3-1. Importance of data grid 4

    4-1. Possible solutions. 5

    5-1. Suggested solution. 5

    6-1. Thesis questions. 8

    6-1. Objectives of the thesis. 8

    7-1. Thesis structure. 9

    Chapter 2. A review of previous records. 10

    2-1. Introduction. 11

    2-2. Data replication techniques 11

    2-3. A framework for data replication 12

    Chapter 3. Algorithm of dynamic replication in data grid using initial data fetching 29

    3-1. Introduction. 30

    3-2. PDDRA architecture. 30

    3-3. Steps to perform the PDDRA algorithm. 32

    3-3-1. Phase 1: Storing the file access pattern. 33

    3-4. Phase 2 of the initial retrieval algorithm. 38

    3-4-1. Manager's responsibility to update copy. 40

    3-4-2. Local server structure and grid sites. 41

    3-5. Phase 3: Replacement. 46

    3-5-1. PDDRA Replacement Algorithm. 48

    3-6. conclusion 49

    Chapter 4. The proposed algorithm. 50

    4-1. Introduction. 51

    4-2 The proposed data replication algorithm 51

    4-3. Algorithm description. 53

    4-3-1. First phase: file request and duplicating. 53

    4-3-2 The second phase: replacement. 54

    Chapter 5. Algorithm simulation. 56

    5-1 Introduction. 57

    5-2. Algorithm simulation. 57

    5-2-1 access patterns. 59

    5-2-2. Configuration files for optorsim settings. 61

    5-3. Simulation results. 62

    5-3-1. Fuzzy system implementation. 63

    5-4. Performance evaluation. 63

    6-4. Network efficiency. 66

    Chapter 6. Conclusions and suggestions. 67

    6-1. Introduction. 68

    6-2. Suggested solution. 68

    6-3. conclusion 68

    5-2. future works 69

    References. 70

     

     

    Source:

    [1] Ghilavizadeh Z., Mirabedini S. J., Harounabadi A., “A New Fuzzy Optimal Data Replication Method for Data Grid”, Management Science Letters Journal, (2013) 927-936.

    [2] ChangR. S., ChenP. H., Complete and fragmented selection and retrieval indata grids, Future Generation Computer Systems, 23 (2007) 536–546.

    [3] FosterI., The grid: A new infrastructure for 21st century science, (2002).

    [4] RanganathanK., FosterI., Design and evaluation of dynamic replication strategies for a high performance data grid, in: International Conference on Computing in High Energy and Nuclear Physics, vol. 2001, (2001).

    [5] Lamehamedi H., Szymanski B., Shentu Z., Deelman E., Data replication strategies in grid environments, 5th International Conference on Algorithms and Architecture for Parallel Processing, (2002) 0378.

    [6] RanganathanK., IamnitchiA., FosterI., Improving data availability through dynamic model-driven replication in large peer-to-peer communities, Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, (2002) 376.

    [7] RahmanR.M., BarkerK., AlhajjR., Replica placement in data grid: Consideringutility and risk, Information Technology: Coding and Computing, 1 (2005)354 - 359.

    [8] S. Vazhkudai, S. Tuecke, I. Foster, Replica selection in the globus data grid, First IEEE International Symposium on Cluster Computing and the Grid, (2001) 106.

    [9] StockingerH., SamarA., HoltmanK., AllcockB., FosterI., TierneyB., File and object replication in data grids, Cluster Computing, 5 (3) (2002) 305–314.

    [10] YuanY., WuY., Yang, F. Yu, Dynamic data replication based on local optimization principle in data grid, Sixth International Conference on Grid and Cooperative Computing, (2007) 815 - 822.

    [11] Foster I., Ranganathan K., Design and evaluation of dynamic replication strategies a high performance Data Grid, in: Proceedings of International Conference on Computing in High Energy and Nuclear Physics, (2001) 20.

    [12] Meroufel B., Belalem G., Dynamic Replication Based on Availability and Popularity in the Presence of Failures, Journal of Information Processing Systems (JIPS), (2012), Dynamic Replication Based on Availability and popularity in the Presence of Failures, Journal of Information Processing Systems (JIPS), (2012) 263-278.

    [13] CibejU., SlivnikB., RobicB., The complexity of static data replication in datagrids, Parallel Computing 31 (8) (2005) 900-912.

    [14] Dong, X. LiJ., WuZ., ZhangD., XuJ., On Dynamic Replication Strategies in Data Service Grids, 11th IEEE International Symposium on Object Oriented Real-Time Distributed Computing (ISORC), Orlando, (2008) 155–161. [15] Amjad T., Sher M., Daud A., A survey of dynamic replication strategies for improving data availability in data grids, Future Generation Computer Systems, (2012) 337–349.

    [16] BsoulM., A Framework for Replication in Data Grid, International Conference on Networking, Sensing and Control Delft, (2011) 978-981.

    [17] Sashi, K., AntonyS.T., Dynamic Replica Management for Data Grid, IACSIT International Journal of Engineering and Technology, (2010) 2- 4. [18] Park S.M., Kim J.H., KoW. B., Yoon W. S., Dynamic Data Replication Strategy Based on Internet Hierarchy BHR, in Lecture notes in Computer Science Publisher, (2004) 838-846.

    [19] Loukopoulos T., AhmadI., Static and Adaptive Distributed Data Replication Using Genetic Algorithms, Journal of Parallel Distributed Computing, (2004)1270-1285.

    [20] Zhongping Z., Zhang C., Mengfei Z., Wang Z., Dynamic Data Grid Replication Algorithm Based on Weight and Cost of Replica, Telkomnika Indonesian Journal of Electrical Engineering, (2014) 2860-2867.

    [21] ChangR. S., ChangH. P., WangY. T., A dynamic weighted data replication strategy in data grids, The Journal of Supercomputing, 45 (3) (2008) 277-295.

    [22] GuQ., ChenB., ZhangY., Dynamic Replica Placement and Location Strategies for Data Grid, International Conference on Computer Science and Software Engineering, Wuhan-Hubei, (2008) 35-40.

    [23] LStoica, R. Morris, D. Karger, M. F. Kaashoek, and H Balakrishnan, Chord: A Scalable Peer to Peer Lookup Service for Internet Applications, Proceedings of ACM SIGCOMM, (2001) 160-177.

    [24] TangM., LeeB.S., YaoC.K., TangX.Y, Dynamic replication algorithm for the multi-tier data grid, Future Generation Computer. Systems, 21 (5) (2005)775-790.

    [25] ShorfuzzamanM., GrahamP., EskiciogluR., Popularity-driven dynamic replica placement in hierarchical data grids, in: Proceedings of Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, (2008) 524-531.

    [26] Slota R., Skital L., Nikolow D., Kitowski J., Algorithms for automatic data replication in grid environment, in: Roman Wyrzykowski, Jack Dongarra, Norbert Meyer, Jerzy Wasniewski (Eds.), Parallel Processing and Applied Mathematics: 6th International Conference, PPAM 2005, Poznan, Poland, September 11-14, 2005, Revised Selected Papers, in: Lecture Notes in Computer Science, vol. 3911, Springer, 2006, pp. 707–714.

    [27] AbdurrabA.R., XieT., Fire: a file reunion data replication strategy for datagrids, in: 10th IEEE/ACM International Conference on Cluster, Cloud and GridComputing, (2010) 215-223.

    [28] Chang R.S., Chang H.P., Wang W.T., A dynamic weighted data replication strategy in data grids, IEEE/ACS International Conference on Computer Systems and Applications, (2008) 414-421.

    [29] Ghilavizadeh Z., MirabediniS. J., Harounabadi A., A New Fuzzy Optimal Data Replication Method for Data Grid, Management Science Letters Journal, (2013) 927-936.

    [30] SashiK., SanthanamT., Replica Replacement Algorithm for Data Grid Environment, ARPN Journal of Engineering and Applied Sciences, (2013) 86-90.

    [31] Lei M., VrbskyS. V., A Data Replication Strategy to Increase Data Availability in Data Grids.

    [33] KroegarT.M., Long DarrellD.E.

Investigating dynamic data replication algorithms in grid networks and presenting a new algorithm based on parameters of file size, available bandwidth and geographical distance.