Diagnosing pelagism using graphs in Persian texts

Number of pages: 79 File Format: word File Code: 31056
Year: 2014 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Diagnosing pelagism using graphs in Persian texts

    Master's Thesis in Computer Engineering

    Trend: Software

    Abstract

    This thesis focuses on graph-based similarity search in natural language texts. The need for a strong method to present texts is an important issue in the field of plagarism detection. In this project, according to this need, we have introduced a powerful method to present natural language and have used it to detect plagarism. For this purpose, we have defined the concept of "graph modification distance" and used it to calculate the distance between two graphs. Sentences are represented by dependency graphs in which words are connected by their dependencies. Dependency graph extracts the grammatical structure of sentences. The graph-based similarity method has been used in the problem of pelargism detection. The main advantage of graph-based presentation is related to the ability of this method to detect similarities between words. The evaluations showed that the results obtained from the dependence graph have better results than the methods of direct comparison of the graphs. Using the graph modification distance to compare two graphs leads to the improvement of the dependency graph results and increases its efficiency. Keywords: graph modification distance, natural language processing, dependency graphs, detection of plagarism. Chapter 1. Introduction. Some people may intentionally or unintentionally use the works of researchers while not even mentioning the name of the main author of the work, this act is called plagarism. can be Plagarism is the intentional or unintentional act of copying or using the ideas and works of others without crediting the original source. Based on the investigations conducted on the issue of plagarism and the number of articles published on the web and the reflection of concern about its occurrence, it is clear that most educational and research centers in their electronic pages in all parts of the world, including in the developing countries of Asia and Africa, under the influence of the publishers with a history of research journals and also to preserve the scientific dignity of their centers, have started to comprehensively introduce this anti-ethical and anti-social phenomenon as a crime. Plagarism is a problem in the scientific community and is growing rapidly, because data and information from electronic documents and the Internet are quickly and easily obtained through copying and pasting from those sources. This problem occurs when the content of illegal documents is found without permission and without citation. This problem is known as plagarism, and plagarism can include a wide range of deliberate manipulations to accidental copying of other people's content. The main goal of this thesis is to express the graph-based method for text presentation and its use in pelarism detection. The following sections of this chapter state the reasons for using this method and describe the problem of pelargism diagnosis. In addition, they provide a solution for this problem. Finally, the structure of the introductory thesis and the topics that will be discussed in the next chapters will be reviewed.

    The problem of finding similarities between two texts is a common problem in the field of natural language processing. In order to evaluate the similarities between two texts, each text needs a method of presentation. One method is simple text, where a list of words forms a sentence. Plain text is usually used because of its simplicity, but it lacks clear information about the grammatical structure.

    Some aspects of language are better presented using structural presentation methods such as dependency graphs that include words connected together.

    Dependency graphs extract the grammatical structure of a sentence, and are limited to the scope of the same sentence. One of the main advantages of graph-based representation is that dependency graphs are in most cases insensitive to word order. This advantage makes it possible to find similarities between sentences in which the word order is scrambled.

    A stronger representation provides a better foundation for identifying similarities in complex situations. Consider the two sentences presented in the dependence graph of Figures 1-1 and 1-2. Humans must be able to determine that these sentences have the same meaning. However, the automatic recognition of the similarity of these two sentences will have problems due to the substitution of words. If plain text presentation is used, the only common words will be "to", "out", "from", "shot" and "ball". The words "shot" and "ball" are the only words that convey the meaning of the sentence.By looking closely at the dependency graphs, it is clear that there is structural similarity between the sentences.

    With the increasing availability of texts on the web, plagarism has become simpler and easier. A large amount of plagarism texts in the field of teaching and learning are increasing year by year. As a result, there is a strong need for automatic detection of plagarism. 1-1 Problem description The most important part of this thesis is the implementation of an algorithm to calculate the graph modification distance, which calculates the similarity between two graphs. The algorithm is based on the calculation of the number of editing operations required to transform one graph into another [1]. Each editing action has an editing cost, which determines how much an action costs.

    Automatic plagarism detection is a research field that is basically based on textual similarity. The problem of using graph-based text similarity to detect plagarism is stated in research question 1. Research question 1: Is graph-based similarity, within a certain graph modification distance, applicable and computationally feasible for plagarism detection? Experimental plagarism detection systems are often based on simpler text representations, such as ngram matching and vector space model [2 and 3]. As a result, the method can be relatively unique. Due to the uniqueness of the method, some implementation details are not defined. Research question 2 states a problem that specifies the details of the algorithm. Research question 2: What is the best method for calculating the graph editing distance between sentences, especially in terms of the cost of editing and presenting the graph in the field of plagarism detection? In order to evaluate the efficiency of the algorithm for calculating the graph editing distance, its efficiency is compared to existing test systems. Research question 3 raises the problem of comparing the algorithm with existing methods. Research question 3: How does graph-based similarity compare with other methods in the detection of plagarism, such as index-based recovery and ngram matching? 1-2 solutions The following objectives guarantee the answers to the questions raised in the previous section. It will be implemented on the test graph editing interval. The system should be able to evaluate many texts and recognize plagiarism at the sentence level.

    Research objective 2: Some functions will be presented to calculate the cost of implementation editing. Different methods are compared with each other and the resulting systems are evaluated.

    Research objective 3: The output of the experimental system is compared with existing pelargism detection systems. This comparison is considered an experimental evaluation of the system. 1-3 problems in the implementation of the algorithm One of the most important problems in the implementation of semantic plagarism detection software in the Persian language is related to the Persian language itself. As we know, the Persian language is a very complex language and has a grammar with many exceptions, and due to its similarities with the Arabic language, it has problems with the characters that are common in Arabic and Persian, which causes some cases to be misdiagnosed. The Persian language, which is almost similar to the Arabic language, has various forms and changes depending on the third person. These exceptions and thousands of similar problems due to the rich and complex grammar of the Persian language cause the task of detecting plagarism to face many problems and the need to consider many aspects in the implementation of the algorithm.

    Another problem that may arise in the field of semantic plagarism detection is that it is not enough to only use the dictionary of synonyms of words, because it is possible that users who engage in the inhuman act of plagarism and copying other people's works, in some cases instead of replacing the synonyms instead of their originals, with a slight change in the structure and appearance of the sentences, use the opposites of the words instead of their originals. Sometimes even without replacing the words, it is possible to create similar sentences with a slight change in the structure of the sentences.

  • Contents & References of Diagnosing pelagism using graphs in Persian texts

    List:

    Introduction. 2

    1-1 Explanation of the problem. 5

    1-2 solutions 6

    1-3 problems in algorithm implementation. 6

    1-4 thesis structure. 7

    Research background. 9

    2-1 Diagnosis of pelarism. 9

    2-2 dimensions of pelarism diagnosis. 12

    2-2-1 Grammar-based method. 12

    2-2-2 Meaning-based methods 13

    2-2-3 Combined methods. 14

    2-2-4 The method of detecting external pelargism. 14

    2-3 methods for calculating the degree of similarity of graphs 15

    2-3-1 The method of the largest common subgraph - the smallest common supergraph. 15

    2-3-2 Method based on state space search. 17

    2-3-3 Possible methods. 18

    3-1 Diagnosis of pelargism. 23

    3-1-1 matching n grams. 23

    3-1-2 Expression weighting. 23

    3-1-3 Generalization of the expression. 24 3-2 Dependency graphs. 25

    3-2-1 dependencies 26

    3-3 graph editing interval. 26

    3-3-1 Editing operation. 26

    3-3-2 The issue of attribution. 27

    3-3-3 Cost matrix. 28

    3-3-4 Assignment algorithms. 29

    4-1 Architecture. 32

    4-2 Text preprocessing. 32

    4-2-1 Finding sentences. 33

    4-2-2 rooting words. 34

    4-2-3 Formation of dependency graph. 40

    4-3 Candidate extraction 44

    4-3-1 Sentence indexing. 45

    4-3-2 Extraction of candidate sentences 45

    4-4 Analysis of details. 45

    4-4-1 Distance algorithm for editing two graphs. 48

    4-4-2 Detection of plagarism based on GED provided in this project 49

    5-1 Detection of plagarism of word displacement and sentence structure change. 55

    5-1-1 10% structural changes. 56

    5-1-2 50% structural changes. 57

    5-2-2 100% structural changes. 59

    5-2 Recognizing semantic pelagism. 60

    5-2-1 semantic changes of 10 percent. 60

    Conclusions and suggestions. 64

    References. 67

     

    Source:

    Fankhauser, S., K. Riesen, and H. Bunke. Speeding up graph edit distance computation through fast bipartite matching. Graph-Based Representations in Pattern Recognition, (2011)

     

    Suchomel, S., J. Kasprzak, and M. Brandejs (2012). Three way search engine queries with multi-feature document comparison for plagiarism detection. See Forner et al. (2012).

     

    Grman, J. and R. Ravas Improved implementation for _nding text similarities in large sets of data - notebook for PAN at clef 2011. See Petras et al. (2011).

    Asim M. El Tahir Ali, Hussam M. Dahwa Abdulla, and V´aclav Sn´a?sel Overview and Comparison of Plagiarism Detection Tools, Dateso 2011, pp. 161{172, ISBN 978-80-248-2391-1.

    A. S. Bin-Habtoor and M. A. Zaher "A Survey on Plagiarism Detection Systems", International Journal of Computer Theory and Engineering Vol. 4, No. 2, April 2012

     

    Sindhu.L, Bindu Baby Thomas, Sumam Mary Idicula A Study of Plagiarism Detection Tools and Technologies,  IJART, Vol. 1 Issue 1, 2011,64-70.

     

    Schleimer, S., Wilkerson, D. and Aiken, A. (2003) Winnowing: Local Algorithms for Document Fingerprinting. SIGMOD 2003, San Diego, 9-12 June 2003, 76-85.

     

    J.A. Malcolm and P.C.R. Lane, Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector, 3rd PANWORKS-HOP. UNCOVERING PLAGIARISM, AUTHORSHIP AND SOCIAL SOFTWARE MISUSE, 2009, p. 29. C. Basile, G. Cristadoro, D. Benedetto, E. Caglioti, and M. Degli Es-posti, A plagiarism detection procedure in three steps: selection, matches and "squares", 3rd pan workshop. Uncovering plagiarism, authorship and social software misuse, 2009, p. 19.

     

    Adam Shenker Horste Bunke, Mark Last and Abraham Kandle Graph Theoretic Techniques For Web Content Mining, Published by World Scientific Publishing, USA 2005

     

    Ahmed Hamza Osman, Naomie Salim and Mohammed Salem Binwahlan, Plagiarism Detection Using Graph-Based Representation, Journal Of Computing,

     

    Adam Shenker Horste Bunke, Mark Last and Abraham Kandle Graph Theoretic Techniques For Web Content Mining, Published by World Scientific Publishing, USA 2005

     

    Ahmed Hamza Osman, Naomie Salim and Mohammed Salem Binwahlan, Plagiarism Detection Using Graph-Based Representation, Journal Of Computing, Volume 2, Issue 4, Issn 2151-9617 , April 2010.

     

    H. Bunke, On a relation between graph edit distance and maximum common subgraph, Pattern Recognition Letters (1997) H. Bunke and K. Shearer, A graph distance metric based on the maximal common subgraph, Pattern Recognition Letters, Vol. 19, 1998 J. T. L. Wang, K. Zhang, and G.-W. Chirn, Algorithms for Approximate Graph Matching, Information Sciences, Vol. 82, 1995

    R. C. Wilson and E. R. Hancock, Structural Matching by Discrete Relaxation, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 6, June 1997

    R. Myers, R. C. Wilson, and E. R. Hancock, Bayesian Graph Edit Distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vo.!22, No. 6, June 2000. Papineni, K., S. Roukos, T. Ward, and W. Zhu (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. Association for Computational Linguistics. Stamatatos, E. Plagiarism detection using stopword n-grams. Journal of the American Society for Information Science and Technology (2011) Jones, K. A statistical interpretation of the term specificity and its application in retrieval. Journal of documentation (1972) Marcus, M., M. Marcinkiewicz, and B. Santorini Building a large annotated corpus of English: The penn treebank. Computational linguistics(1993).

    Riesen, K. and H. Bunke Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision Computing (2009).

     

    Porter, M. F. An algorithm for suffix stripping. Program, pp. 137-130. (1980).

    Megerdoomian, K. (2004). Finite-state morphological analysis of Persian. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, University of Geneva, Iran.

     

    Sheykhzadegan, J. and M. Bijankhan (2006).  The speech databases of Persian language. In Proceedings of the 2nd Workshop on Persian Language and Computing, the University of Tehran, Tehran, Iran, pp. 247-261.

    Taghva, Beckley and Sadeh. A stemming algorithm for the Farsi language. IEEE ITCC, pp. 158 - 162. 2005.

     

    Anvari, H. & Ahmadi Givi, H. (2006).  Persian Language Grammar (2nd Ed.). Tehran: Fatemi Publication. A. A. Sharifloo, and M. Shamsfard, "A Bottom up Approach to Persian Stemming", Proceedings of the Third International Joint Conference on Natural Language Processing, 2008.

Diagnosing pelagism using graphs in Persian texts