An efficient model for creating a parallel text corpus from a comparative text corpus

Number of pages: 94 File Format: word File Code: 31087
Year: 2013 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of An efficient model for creating a parallel text corpus from a comparative text corpus

    Master's Thesis in Computer Engineering (Software)

    Abstract

    Most of the modern translation approaches in the field of machine translation, including statistical machine translation, example-based machine translation, and combined machine translation, use a set of co-translated texts under the name of parallel text bodies as the main educational data. But for most languages, parallel bodies are available to a very small extent, or they are related to a specific range of texts. On the other side, there are adaptive bodies whose raw materials are easily obtained. Comparative corpuses do not include co-translation texts, but both texts in two different languages ??in terms of the similarity of criteria such as content, publication date, title, etc. are compatible with each other.

    Compatible structures include sentences that can be a good translation for each other. The purpose of this treatise is to automatically construct a parallel corpus by extracting such sentences from the comparative corpus. The model presented in this research consists of three main steps: (1) selection of pairs of parallel candidate sentences using the filter of sentence length ratio and the filter of the number of common words (2) selection of pairs of parallel sentences using the maximum entropy classifier and considering the features related to the length of the two sentences, their common words and features based on the alignment at the word level between the two sentences (3) increasing the accuracy of the pairs of sentences extracted by selecting only one of the sentences paired with each sentence. This work can be done by calculating the proximity of that sentence by translating paired sentences from the other side by the TER criterion and selecting the closest sentence. Finally, the efficiency of the model presented in two parts (1) evaluation of the designed maximum entropy classifier and (2) evaluation of the usefulness of the extracted parallel sentence pairs in improving the quality of machine translation is examined. 

    Chapter One

    Introduction

    Due to the increase in regional interconnections and the need for information exchange, the demand for language translation has greatly increased. Many documents need translation, including scientific and technical documents, manuals, legal documents, textbooks, advertising brochures, newspaper news, etc.; Some of them are difficult and challenging to translate, but most of them are boring and repetitive, and at the same time, they need coherence and precision. It is difficult for professional translators to meet the ever-increasing translation needs. In such a situation, machine translation can be used as an alternative.

    After 65 years, machine translation is one of the oldest computer applications. Over the years, machine translation has been the focus of research by linguists, psychologists, philosophers, computer science scientists, and engineers. It is not an exaggeration to say that the new work in the field of machine translation has significantly contributed to the development of fields such as computational linguistics, artificial intelligence and program-oriented natural language processing.

    (Images are available in the main file)

    Machine translation can be defined as: "Translation from one natural language (source language) to another language (target language) using computerized systems and with or without human help". Research work in the field of machine translation is not limited to the grand goal of fully automated high quality (publishable) translation. Rough translations are often enough to revise foreign topics. Recent efforts are towards making limited applications in combination with speech recognition, especially for handheld devices. Machine translation can be used as a basis for further editing, translators usually use tools such as translation memories that use machine translation technology but put them under their control.

    Machine translation is one of the research areas of "Computational Linguistics". So far, various methods have been invented to automate translation, which are categorized in different ways in the field of machine translation. Figure 1-1 shows the types of machine translation methods available in the form of categories as given in [1].

    1-1-1. Dictionary-based machine translation[1]

    This type of machine translation is based on dictionary entries; And in it, the word equivalent is used to produce the translation. The first generation of machine translation (from the late 1940s to the mid-1960s) was based entirely on electronic dictionaries. This method is still useful to some extent in translating phrases rather than sentences.. Most of the methods that were developed later use more or less bilingual dictionaries [1].

    1-1-2. Rule-based machine translation[2]

    (images are available in the main file)

    Rule-based machine translation deals with morphological, syntactic and semantic information of source and target languages. Linguistic rules are built from this information. This method can deal with various linguistic phenomena and is extensible and maintainable, but the exceptions in the grammar add problems to this system. Also, its research process requires a lot of investment. The goal of rule-based machine translation is to convert source language structures into target language structures. This method has different approaches.

    Direct approach [3]: source language words are translated without going through an intermediate representation. In this method, the context of the text, meaning and scope are not taken into account.

    The transfer approach[4]: The transfer model belongs to the second generation of machine translation (from the mid-1960s to the 1980s). In this model, the source language is transferred to an abstraction that is a less language-specific representation. Then an equivalent representation for the target language (with the same level of abstraction) is produced using a bilingual dictionary and grammar rules.

    Interlingual [5]: This method belongs to the third generation of machine translation. In this method, the source language is transformed into an intermediate language (display), which is independent of both participating languages ??(source and destination) in the translation. The translation for the target language is then derived from this auxiliary representation. Therefore, in this type of system, only two analysis and combination modules are needed. Also, due to the independence of this method from the source and destination languages, it is mostly used in multilingual translation machines. This method emphasizes a single representation of different languages.

    1-1-3. Knowledge-based machine translation[6]

    This method deals with a conceptual dictionary that represents a domain. This method includes two stages of analysis and production. The basic components of a knowledge-based translation machine are an ontology of concepts, a dictionary and grammar of the source language for the analysis process, a dictionary and grammar for the target language, and mapping rules between the syntax of the intermediate language and the source and target languages.

    1-1-4. corpus-based machine translation[7]

    The approach of corpus-based machine translation emerged in 1989 and was widely discussed in the field of machine translation; And because of the high accuracy of this method in translation, it was overcome by other methods. In this method, knowledge or translation model is automatically obtained from bilingual text corpus (set of texts). Since this approach works with large amounts of data, it has been called corpus-based machine translation. Some types of corpus-based methods are described below.

    Statistical machine translation [8]

    Although the initial idea of ??statistical machine translation was introduced by Warren Weaver in 1941, it has been widely used since 1993 when this method was modeled by IBM researchers; So that currently statistical machine translation is the most common approach in machine translation. In the statistical machine translation method, statistical models are used, and the parameters of these models are extracted from bilingual texts or "parallel bodies". In other words, the statistical machine translation system learns the translation probabilities from the parallel corpus and uses these probabilities to produce a suitable translation for the input sentences that have not been trained in the process. In this method, two main models are used, namely word-based models and phrase-based models.

    Example-based machine translation[9]

    Example-based machine translation methods are also called memory-based methods[10]. The idea of ??this method started in 1980 in Japan. These types of systems try to find a sentence similar to the input sentence in the parallel corpus, and then produce a translation of the input sentence by applying changes to the previously stored translated sentence.

    The basic idea in this method is to use existing human translations to translate new texts. Therefore, it is enough to break the new texts into small pieces and search the equivalent translation of these pieces in a database of translated pieces and produce the desired translation. This method has data limitations. Compiling a very large set of examples also does not cover the entire language.

  • Contents & References of An efficient model for creating a parallel text corpus from a comparative text corpus

    List:

    1. Introduction. 2

    1-1. Introduction. 2

    1-1-1. Dictionary-based machine translation. 3

    1-1-2. Rule-based machine translation. 4

    1-1-3. Knowledge-based machine translation. 5

    1-1-4. corpus-based machine translation. 5

    Statistical machine translation. 6

    Example-based machine translation. 6

    Text-based machine translation. 7

    1-2. The necessity of building a parallel structure. 7

    1-3. Research problem: construction of parallel bodies. 9

    1-4. The purpose of the research: making a parallel body from the comparative body. 10

    1-5. Headings 10

    1-5-1. The second chapter: theoretical foundations. 10

    1-5-2. The third chapter: an overview of the research done. 11

    1-5-3. Chapter 4: Proposed model. 11

    1-5-4. The fifth chapter: evaluation and conclusion. 12

    2. Theoretical foundations. 14

    2-1. body 14

    2-1-1. parallel body 15

    2-1-2. Adaptive body. 17

    2-2. alignment 18

    2-2-1. Alignment at the document level. 19

    2-2-2. Alignment at the sentence level. 19

    2-2-3. Alignment at the word level (lexical alignment). 21

    Lexual alignment using IBM models. 22

    2-3. Evaluation of machine translation. 23

    2-3-1. blue 23

    2-3-2. NIST metric. 24

    2-3-3. Word error rate. 24

    2-3-4. Translation error rate (TER). 25

    3. An overview of the research done. 28

    3-1. Introduction. 28

    3-2. Building a parallel corpus from co-translation texts. 28

    3-3. Extracting parallel sentences from the web. 30

    3-4. Extracting parallel sentences from comparative corpora. 32

    3-5. Recognition of parallel sentences using maximum entropy classifier. 34

    3-6. Construction of English-Persian parallel corpus. 36

    4. The proposed model. 39

    4-1. Introduction. 39

    4-2. Selection of pairs of parallel candidate sentences. 40

    4-2-1. Filter common words. 41

    Converting the encoding of characters 42

    Determining the boundaries of sentences and words 43

    Finding roots. 44

    remove frequently used words 45

    eliminate ambiguity. 45

    Searching for meanings from the dictionary. 46

    Grouping the repeated words of the sentence along with the number of occurrences in the sentence. 46

    Algorithm to find the rate of common words (from the source) 47

    4-3. Selecting pairs of parallel sentences from candidate pairs of sentences. 48

    4-3-1. Maximum entropy classifier. 48

    4-3-2. General features. 49

    Features based on the length of two sentences. 49

    Rate of common words. 50

    4-3-3. Word-level alignment-based features of a pair of sentences. 50

    unmatched words 50

    fertility. 51

    Continuous range. 52

    Alignment score. 53

    4-4. Increasing the accuracy of extracted pairs of parallel sentences. 54

    4-5. Model evaluation method. 55

    5. Evaluation and conclusion. 58

    5-1. Evaluation of maximum entropy classifier. 58

    5-1-1. Evaluation of features 58

    5-1-2. Domain sensitivity. 60

    5-2. Configurations and tests of building a parallel body from an adaptive body. 63

    5-2-1. Adaptive body used. 63

    Persian-English comparative text of Tehran University (UTPECC) 63

    Comparative text taken from Wikipedia articles 65

    5-2-2. parameters set and tools used. 66

    Selecting pairs of candidate sentences: 66

    Selecting pairs of parallel sentences: 68

    Increasing the accuracy of pairs of extracted sentences: 69

    5-2-3. Evaluation of parallel sentences extracted using translation machine. 69

    5-3. conclusion 72

    5-4. Future offers. 75

     

    Source:

     

    [1]S. Tripathi and J. K. Sarkhel, "Approaches to machine translation", Annals of Library and Information Studies, vol. 57, pp. 388-393, December 2010.

    A. Lopez, "statistical machine translation", ACM Computing Surveys, vol. 40, no. 3, pp. 1-49, 2008.

    P. F. Brown, J. Cocke, S. A. Della-Pietra, V. J. Della-Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer and P. S. Roossin, “A statistical approach to machine translation”, Comput Linguist, vol. 16, no. 2, pp. 79-85, 1990.

    F. J. Och and H. Ney, "Discriminative training and maximum entropy models for statistical machine translation", in 40thNey, "Discriminative training and maximum entropy models for statistical machine translation", in 40th Annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp. 295–302, 2002.

    P. Koehn, "Europarl: a parallel corpus for statistical machine translation", in MT Summit X: the tenth machine translation summit, Phuket, Thailand, pp. 79-86, 2005.

    M. Mohaghegh, A. Sarrafzadeh and T. Moir, "Improved Language Modeling for English-Persian Statistical Machine Translation", Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation (COLING 2010), Beijing, pp. 75–82, August 2010.

    Supreme Council of Information and Communication Technology. (2013). Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

    A. Mansouri and H. Faili, "State-of-the-art English to Persian Statistical Machine Translation System", in 16th CSI International Symposium on Artificial Intelligence and Signal Processing, pp. 174-179. IEEE, Fars, 2012.

    T. Ishisaka, K. Yamamoto, M. Utiyama and E. Sumita, "Development of a Japanese-English software manual parallel corpus", MT Summit XII: proceedings of the twelfth machine translation summit, Ottawa, ON, Canada, pp. 254-259, 2009.

    M. T. Pilevar, A. H. Pilevar and H. Faili, "TEP: Tehran English-Persian Parallel Corpus", In: Gelbukh, A. (eds.) Computational Linguistics and Intelligent Text Processing. LNCS, vol. 6609, pp. 68-79. Springer, Heidelberg, 2011.

    F. Jabbari, S. Bakhshaei, S. M. Mohammadzadeh Ziabary and S. Khadivi, "Developing an Open-domain English-Farsi Translation System Using AFEC: Amirkabir Bilingual Farsi-English Corpus", Fourth Workshop on Computational Approaches to Arabic-Script-based Languages ??(AMTA 2012), San Diego, CA, USA, November 2012.

    J. Nie, M. Simard, P. Isabelle and R. Dur, "Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web", Proceedings of the 22nd annual international ACMSIGIR conference on research and development in information retrieval (SIGIR '99), Berkeley, CA, pp. 74-81, 1999.

    P. Resnik and N. A. Smith, "The web as a parallel corpus", Comput Linguist, vol. 29, no. 3, pp. 349-380, 2003.

    Y. Zhang, K. Wu, J. Gao, and P. Vines, "Automatic acquisition of Chinese-English parallel corpus from the Web", Proceedings of 28th European Conference on Information Retrieval, pages 420-431. Lecture Notes in Computer Science, Vol. 3936, Springer, January 2006.

    D. W. Oard, "Alternative approaches for cross-language text retrieval", In AAAI symposium on cross-language text and speech retrieval, Stanford, CA, USA, pp. 154-162, 1997.

    J. Tiedemann, "Parallel Data, Tools and Interfaces in OPUS", In Proceedings of the 8th International Conference on Language Resources

    [16]and Evaluation (LREC'2012), 2012.

    R. Zajac, S. Helmreich and K. Megerdoomian, "Black-Box/Glass-Box Evaluation in Shiraz", Workshop on Machine Translation Evaluation at LREC-2000, Athens, Greece, 2000.

    R. S. Belvin, W. May, S. Narayanan, P. Georgiou and S. Ganjavi, "Creation of a Doctor-Patient Dialogue Corpus Using Standardized Patients", International Conference on Language Resources and Evaluation (LREC), 2004.

    B. Qasemizadeh and S. Rahimi, "The First Parallel Multilingual Corpus of Persian: Toward a Persian BLARK", the second workshop on Computational Approaches to Arabic Script-based Languages ??(CAASL-2), California, USA, 2007.

    M. Mohaghegh and A. Sarrafzadeh, "Performance evaluation of various training data in English-Persian Statistical Machine translation", 10th International Conference on the Statistical Analysis of Textual Data (JADT2010), Rome, Italy, 2010.

    M. A. Farajian, "Pen: Parallel English-Persian News Corpus", Proceedings of the 2011th World Congress in Computer Science, Computer Engineering and Applied Computing, 2011.

    F. Jabbari, S. Bakhshaei, S. M.

An efficient model for creating a parallel text corpus from a comparative text corpus