Providing a suitable data model to discover the transmission of genetic diseases

Number of pages: 95 File Format: word File Code: 31010
Year: 2014 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Providing a suitable data model to discover the transmission of genetic diseases

    Computer Engineering Master Thesis

    Software Orientation

     

    Abstract

    In today's society, due to the progress of medical science, the volume of medical data is increasing rapidly. For faster and more efficient analysis of these data, we need electronic storage of these data. The data related to genetic diseases are also considered from this category of data. Considering this issue, we should design suitable databases for storing and retrieving these data. Considering the nature of genetic data and the problem of transmission of genetic diseases, relationships between people and its analysis is considered an important issue, in this article we use the graph data model, which is one of the subsets of the non-structured data model (NOSQL), to store and retrieve this data. For this purpose, we first determine the needs and queries related to this issue and based on that, we design the data graphic model. To evaluate the designed data model, a team of genetic experts also reviewed this data model and expressed their favorable opinion regarding the use of this data model for genetic diseases. We also used the Neo4j software, which stored the data related to the genetic disease Thalassemia, and we examined this data model based on the efficiency of storing and retrieving information and the time of the queries. Considering the time of the queries and the lack of support of other data models for the relationships between people, this data model is considered a suitable model.

    Key words: genes, genetic diseases, graph databases, neo4j, data model

    Chapter First: Introduction

    1-1-Preface

    In the field of medicine, data are being produced and spread rapidly. These data are produced in different forms than the data of the past, and according to the advances of science in this field, the need for new management is felt much more than in the past. To store these data, we need databases that are capable of supporting various types of data and a large volume of data, and also have the ability to perform correct and complete data management [14].

    Regarding genetic diseases, the data that we need to store are diverse data. According to the nature of genetic diseases, in order to understand how these diseases are transmitted, we also need to save the health status of the ancestors of the patients. In each investigation, a new person may be added to this genealogy. Also, to discover the path of transmission of diseases, the relationships between people in this database are very important and necessary. To support these needs and manage relationships between people and disease transmission, structured databases [1] are not a suitable option, because they are not able to support different types of data.

    Unstructured databases [2] are more suitable options to support different types of data. There are different types of NOSQL databases, but due to the fact that in this type of disease, relationships between people are very important and we must be able to add entities at any time, graph databases are the right case. The cell nucleus consists of 46 chromosomes or 23 pairs of chromosomes. Chromosomes have tangled strands called DNA, which contain genes. Each cell of the human body contains 25,000 to 35,000 genes [1]. Genes contain information that makes up human characteristics. Genes are composed of pairs of bases called nucleotides. The basic composition consists of 4 structures: Adenine [3] / Guanine [4] / Cytosine [5] / Thymine [6] In fact, we use 4 letters A, T, C, G to express each gene, which is called the nucleotide sequence. The nucleotide sequences of different diseases are different from each other. For example, the length of the insulin gene sequence is 333 characters. So far, the longest known nucleotide sequence is related to Duchenne's disease. The length of this sequence is 2.3 megabytes. As examples of genetic diseases, leprosy, skin cancers, mental retardation, sickle cell anemia, phenylketonuria, thalassemia and the like can be mentioned [2].

    Some of the cases in which genetic tests are performed are as follows:

    • A couple who is planning to start a family and one of them or one of his close relatives has a hereditary disease.

    • A person who has a child with an acute birth defect.

    • A child with a physical problem that can be genetic.

    In order to perform these genetic tests in the first place, you need to have We have the genealogy of couples when forming a family, parents during pregnancy tests, and patients when examining a genetic patient. After knowing the genealogy, when dealing with diseases, we need to store information about the patient. To store data related to genetic diseases, we need a database that can well support the storage of all types of data. For this data, we need a data model that can analyze and analyze this data in addition to storing it. One of the important issues for choosing a data model is that in the case of medical data for each patient, we may save characteristics that we do not need for other patients, for example, we may need to save blood test results for one patient, but we do not need to perform this test and save this test for another patient, or we may come across things that were not foreseen during the examination of the patient's condition, for this reason, it is better not to design a general plan for the database from the beginning in order to be able to save any characteristic that we needed or while working with We encountered it and added it for the patient. Based on this issue, we come to the conclusion that we cannot use SQL databases and NOSQL databases are more suitable for this. Another important issue is that due to the need for the genetics of family members and previous generations of the patient, we must have the ability to add new entities (previous and subsequent generations) to this database while conducting research. Regarding the transmission of diseases, it is very important to determine the route of transmission of the disease because it must be determined that the disease was inherited from the father or mother, which of the ancestors in the next stage, and also to determine which of the male or female children may inherit the disease. For this reason, we must design databases that have the ability to extract relationships between entities. Of course, the relationships between the entities in the interface databases can also be extracted, but firstly, it is very complicated and time-consuming due to the need to write nested procedures. Considering these three issues, we come to the conclusion that the graph database is the best choice for this type of disease.

           In this thesis, we will design a database using the graph data model that has the ability to store different types and volumes of data. The designed database must have the ability to perform operations on these stored data and be able to extract the desired results from it in investigating the transmission of genetic diseases. Results such as the path of disease transmission, the possibility of disease transmission to the next generation or the possibility of disease transmission to a certain gender of the next generation, the percentage of disease transmission, etc. In this database, the entities that are the same people are stored in the nodes, in addition to the general characteristics of the patients, all the information related to the diseases of the people, conditions and symptoms of the patients are also stored in the nodes. In the next levels of the graph, higher generations of patients will be stored along with information about the specific disease they are researching. We will use

    edges to display the relationships between people in this data model. In this way, if the disease is transmitted from one person to another, we will use the directed edge to show this transmission. In addition to these, we can also add explanations on the ridges. Explanations such as the percentage of the probability of transmission of a specific disease from one person to another. 1-3-Importance and necessity of conducting research So far, several data models have been used to store medical information, but each of these data models has disadvantages that make them not ideal data models. One of these data models is the interface data model.

  • Contents & References of Providing a suitable data model to discover the transmission of genetic diseases

    List:

    Summary

    Chapter One: Introduction. 1

    1-1 Preface. 2

    1-2 statement of the problem. 3

    1-3 The importance and necessity of conducting research. 6

    1-4 aspects of newness and innovation in research. 7

    1-5 specific research objectives. 8

    1-6 review of the thesis structure. 8

    Chapter Two: Concepts. 9

    2-1 Introduction. 10

    2-2 What is the data model? 10

    2-2-1 Structured data models. 12

    2-2-1-1 relational data model. 12

    2-2-1-2 object-oriented data model 14

    2-2-1-3 relational object data model. 16

    2-2-2 unstructured data model. 17

    2-2-2-1 key/value data model. 19

    2-2-2-2 document oriented data model 21

    2-2-2-3 columnar data model. 23

    2-2-2-4 Graph databases. 24

    Title Page

    2-3 Data Management 29

    2-4 Medical Data. 30

    2-5 applications of medical data management. 34

    2-6 genetic diseases. 36

    2-7 transmission of genetic diseases. 37

    2-8 genetic tests. 38

    Chapter three: Background of the research. 42

    3-1 Introduction. 43

    3-2 Relational data model for epidemic diseases. 43

    3-3 object relational data model for hospitals 44

    3-4 data graphic model for epidemic diseases. 47

    Chapter Four: Suggested method. 48

    4-1 Introduction. 49

    4-2 entities 53

    4-3-Attributes related to each of the entities 53

    4-3-1 healthy human. 53

    4-3-2 human carrier. 54

    4-3-3 treated human 54

    4-3-4 sick human. 54

    4-3-5 doctor. 55

    Title

    4-3-6 Disease. 55

    4-3-7 symptoms. 57

    4-3-8 Treatment methods. 58

    4-3-9 medicine. 58

    4-4 Values ??stored on the edges 59

    4-5 Determination of data model capabilities 60

    4-5-1 Creating. 60

    4-5-2 Add. 61

    4-5-3 Update. 61

    4-5-4 delete. 61

    4-5-5 questions. 62

    4-5-5-1 Query objects related to one node 62

    4-5-5-2 Query objects related to two nodes 66

    4-5-5-3 Query objects of more than two nodes 68

    4-6 Data model design 70

    4-6-1 ER design. 70

    4-6-2 graphic model design. 72

    Chapter Five: Evaluation. 75

    5-1 Introduction. 76

    5-2 First method: Focus group. 77

    5-2-1 Getting to know the focus group. 77

    Title                                                                    . 77

    5-2-3 focus group methodology. 77

    5-2-4 Evaluation by the focus group. 79

    5-3 The second method: practical implementation of the database 79

    5-3-1 NeoFerji software. 80

    5-3-2 Required data. 80

    5-3-3 Saving data in the Neo-Fergeian database. 81

    4-5 Results. 83

    Sixth chapter: Summary and future work. 90

    6-1 Summary of future works. 91

    Resources. 97

     

    Source:

    Persian sources

    1- Asad, Mohammad Taghi. Basics of genetics. Dunya, 1380.

    2- Jabarpour Fanadi, Hosseinpour Faizi. Human genetics and human genetic diseases. Art and culture 1366.

    3- Haqjo, Mustafa. Scientific-applied information bank. First volume, fundamental concepts, third edition. Iran University of Science and Technology. 1385

    4- Haqjo, Mustafa; Safai, Ali Asghar. Scientific-applied information bank. Volume II, Advanced Concepts, Third Edition. Iran University of Science and Technology. 1385

    5- Davrpanah, Ahmed; Mehdi Qolikhan Ramin. Management of medical documents, Ministry of Health Research Deputy, 1372

    6-Sharifi Bidgoli, Mina and others. "A graph storage system for network epidemic data storage". The 19th Annual National Conference of the Iranian Computer Association, Shahid Beheshti University, Tehran, 2013

    English Obstacles

    7-Bryan, T., 2013, Literature Survey of Graph Databases, SYSTAP, LLC

    8-Canada Health Infoway. The emerging benefits ofThe emerging benefits of electronic medical record use in community-based care. PwC.

    9-Date, C.j., 2003, An introduction to Database systems.

    10-EI-Sappagh, Sh. 2012, Electronic Health Record Data Model Optimized for Knowledge Discovery. IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 1.

    11-http://www.neo4j.org/learn/neo4j

    12-Ken Ka-Yin, L., Wai-Choi, T., Kup-Sze, C.2012, Alternatives to relational database: Comparison of NoSQL and XML approaches for clinical data storage, Computer methods and programs in biomedicine110(2013)99-109

    13-Lahr, G.etc. 2007, A Dominant B0-Thalassemia-Like Phenotype In A German Caucasian Family Is Associated With Mild Chronic Hemalytic Anemia But Influenced In Severity By Co-Inherited Genetic Factors, hematologica September 92: 1264-1265; Doi:10.3324/haematol.11383

    14-Rabinson, I., Webber, J., Eifrem, E. 2013, Graph Databases, O'Reilly Media

    15-Tayie, S.,2005, Research Methods and Writing Research Proposals, Cairo university.

    16-Vaish, G. 2013, Getting Started with NoSQL, Packt Publishing Ltd.

Providing a suitable data model to discover the transmission of genetic diseases