Word Files
Reference for Downloading Educational Files

An efficient model for creating a parallel text corpus from a comparative text corpus

Number of pages: 94 File Format: word File Code: 31087
Year: 2013 University Degree: Master's degree Category: Computer Engineering

Tags/Keywords: Adaptive bodies - Comparative text corpus - machine translation - Parallel figure - Parallel text body - Statistical machine translation

Part of the Content
Contents & Resources

Summary of An efficient model for creating a parallel text corpus from a comparative text corpus

Master's Thesis in Computer Engineering (Software)

Abstract

Most of the modern translation approaches in the field of machine translation, including statistical machine translation, example-based machine translation, and combined machine translation, use a set of co-translated texts under the name of parallel text bodies as the main educational data. But for most languages, parallel bodies are available to a very small extent, or they are related to a specific range of texts. On the other side, there are adaptive bodies whose raw materials are easily obtained. Comparative corpuses do not include co-translation texts, but both texts in two different languages ??in terms of the similarity of criteria such as content, publication date, title, etc. are compatible with each other.

Compatible structures include sentences that can be a good translation for each other. The purpose of this treatise is to automatically construct a parallel corpus by extracting such sentences from the comparative corpus. The model presented in this research consists of three main steps: (1) selection of pairs of parallel candidate sentences using the filter of sentence length ratio and the filter of the number of common words (2) selection of pairs of parallel sentences using the maximum entropy classifier and considering the features related to the length of the two sentences, their common words and features based on the alignment at the word level between the two sentences (3) increasing the accuracy of the pairs of sentences extracted by selecting only one of the sentences paired with each sentence. This work can be done by calculating the proximity of that sentence by translating paired sentences from the other side by the TER criterion and selecting the closest sentence. Finally, the efficiency of the model presented in two parts (1) evaluation of the designed maximum entropy classifier and (2) evaluation of the usefulness of the extracted parallel sentence pairs in improving the quality of machine translation is examined.

Chapter One

Introduction

Due to the increase in regional interconnections and the need for information exchange, the demand for language translation has greatly increased. Many documents need translation, including scientific and technical documents, manuals, legal documents, textbooks, advertising brochures, newspaper news, etc.; Some of them are difficult and challenging to translate, but most of them are boring and repetitive, and at the same time, they need coherence and precision. It is difficult for professional translators to meet the ever-increasing translation needs. In such a situation, machine translation can be used as an alternative.

After 65 years, machine translation is one of the oldest computer applications. Over the years, machine translation has been the focus of research by linguists, psychologists, philosophers, computer science scientists, and engineers. It is not an exaggeration to say that the new work in the field of machine translation has significantly contributed to the development of fields such as computational linguistics, artificial intelligence and program-oriented natural language processing.

(Images are available in the main file)

Machine translation can be defined as: "Translation from one natural language (source language) to another language (target language) using computerized systems and with or without human help". Research work in the field of machine translation is not limited to the grand goal of fully automated high quality (publishable) translation. Rough translations are often enough to revise foreign topics. Recent efforts are towards making limited applications in combination with speech recognition, especially for handheld devices. Machine translation can be used as a basis for further editing, translators usually use tools such as translation memories that use machine translation technology but put them under their control.

Machine translation is one of the research areas of "Computational Linguistics". So far, various methods have been invented to automate translation, which are categorized in different ways in the field of machine translation. Figure 1-1 shows the types of machine translation methods available in the form of categories as given in [1].

1-1-1. Dictionary-based machine translation[1]

This type of machine translation is based on dictionary entries; And in it, the word equivalent is used to produce the translation. The first generation of machine translation (from the late 1940s to the mid-1960s) was based entirely on electronic dictionaries. This method is still useful to some extent in translating phrases rather than sentences.. Most of the methods that were developed later use more or less bilingual dictionaries [1].

1-1-2. Rule-based machine translation[2]

(images are available in the main file)

Rule-based machine translation deals with morphological, syntactic and semantic information of source and target languages. Linguistic rules are built from this information. This method can deal with various linguistic phenomena and is extensible and maintainable, but the exceptions in the grammar add problems to this system. Also, its research process requires a lot of investment. The goal of rule-based machine translation is to convert source language structures into target language structures. This method has different approaches.

Direct approach [3]: source language words are translated without going through an intermediate representation. In this method, the context of the text, meaning and scope are not taken into account.

The transfer approach[4]: The transfer model belongs to the second generation of machine translation (from the mid-1960s to the 1980s). In this model, the source language is transferred to an abstraction that is a less language-specific representation. Then an equivalent representation for the target language (with the same level of abstraction) is produced using a bilingual dictionary and grammar rules.

Interlingual [5]: This method belongs to the third generation of machine translation. In this method, the source language is transformed into an intermediate language (display), which is independent of both participating languages ??(source and destination) in the translation. The translation for the target language is then derived from this auxiliary representation. Therefore, in this type of system, only two analysis and combination modules are needed. Also, due to the independence of this method from the source and destination languages, it is mostly used in multilingual translation machines. This method emphasizes a single representation of different languages.

1-1-3. Knowledge-based machine translation[6]

This method deals with a conceptual dictionary that represents a domain. This method includes two stages of analysis and production. The basic components of a knowledge-based translation machine are an ontology of concepts, a dictionary and grammar of the source language for the analysis process, a dictionary and grammar for the target language, and mapping rules between the syntax of the intermediate language and the source and target languages.

1-1-4. corpus-based machine translation[7]

The approach of corpus-based machine translation emerged in 1989 and was widely discussed in the field of machine translation; And because of the high accuracy of this method in translation, it was overcome by other methods. In this method, knowledge or translation model is automatically obtained from bilingual text corpus (set of texts). Since this approach works with large amounts of data, it has been called corpus-based machine translation. Some types of corpus-based methods are described below.

Statistical machine translation [8]

Although the initial idea of ??statistical machine translation was introduced by Warren Weaver in 1941, it has been widely used since 1993 when this method was modeled by IBM researchers; So that currently statistical machine translation is the most common approach in machine translation. In the statistical machine translation method, statistical models are used, and the parameters of these models are extracted from bilingual texts or "parallel bodies". In other words, the statistical machine translation system learns the translation probabilities from the parallel corpus and uses these probabilities to produce a suitable translation for the input sentences that have not been trained in the process. In this method, two main models are used, namely word-based models and phrase-based models.

Example-based machine translation[9]

Example-based machine translation methods are also called memory-based methods[10]. The idea of ??this method started in 1980 in Japan. These types of systems try to find a sentence similar to the input sentence in the parallel corpus, and then produce a translation of the input sentence by applying changes to the previously stored translated sentence.

The basic idea in this method is to use existing human translations to translate new texts. Therefore, it is enough to break the new texts into small pieces and search the equivalent translation of these pieces in a database of translated pieces and produce the desired translation. This method has data limitations. Compiling a very large set of examples also does not cover the entire language.
Contents & References of An efficient model for creating a parallel text corpus from a comparative text corpus

List:

1. Introduction. 2

1-1. Introduction. 2

1-1-1. Dictionary-based machine translation. 3

1-1-2. Rule-based machine translation. 4

1-1-3. Knowledge-based machine translation. 5

1-1-4. corpus-based machine translation. 5

Statistical machine translation. 6

Example-based machine translation. 6

Text-based machine translation. 7

1-2. The necessity of building a parallel structure. 7

1-3. Research problem: construction of parallel bodies. 9

1-4. The purpose of the research: making a parallel body from the comparative body. 10

1-5. Headings 10

1-5-1. The second chapter: theoretical foundations. 10

1-5-2. The third chapter: an overview of the research done. 11

1-5-3. Chapter 4: Proposed model. 11

1-5-4. The fifth chapter: evaluation and conclusion. 12

2. Theoretical foundations. 14

2-1. body 14

2-1-1. parallel body 15

2-1-2. Adaptive body. 17

2-2. alignment 18

2-2-1. Alignment at the document level. 19

2-2-2. Alignment at the sentence level. 19

2-2-3. Alignment at the word level (lexical alignment). 21

Lexual alignment using IBM models. 22

2-3. Evaluation of machine translation. 23

2-3-1. blue 23

2-3-2. NIST metric. 24

2-3-3. Word error rate. 24

2-3-4. Translation error rate (TER). 25

3. An overview of the research done. 28

3-1. Introduction. 28

3-2. Building a parallel corpus from co-translation texts. 28

3-3. Extracting parallel sentences from the web. 30

3-4. Extracting parallel sentences from comparative corpora. 32

3-5. Recognition of parallel sentences using maximum entropy classifier. 34

3-6. Construction of English-Persian parallel corpus. 36

4. The proposed model. 39

4-1. Introduction. 39

4-2. Selection of pairs of parallel candidate sentences. 40

4-2-1. Filter common words. 41

Converting the encoding of characters 42

Determining the boundaries of sentences and words 43

Finding roots. 44

remove frequently used words 45

eliminate ambiguity. 45

Searching for meanings from the dictionary. 46

Grouping the repeated words of the sentence along with the number of occurrences in the sentence. 46

Algorithm to find the rate of common words (from the source) 47

4-3. Selecting pairs of parallel sentences from candidate pairs of sentences. 48

4-3-1. Maximum entropy classifier. 48

4-3-2. General features. 49

Features based on the length of two sentences. 49

Rate of common words. 50

4-3-3. Word-level alignment-based features of a pair of sentences. 50

unmatched words 50

fertility. 51

Continuous range. 52

Alignment score. 53

4-4. Increasing the accuracy of extracted pairs of parallel sentences. 54

4-5. Model evaluation method. 55

5. Evaluation and conclusion. 58

5-1. Evaluation of maximum entropy classifier. 58

5-1-1. Evaluation of features 58

5-1-2. Domain sensitivity. 60

5-2. Configurations and tests of building a parallel body from an adaptive body. 63

5-2-1. Adaptive body used. 63

Persian-English comparative text of Tehran University (UTPECC) 63

Comparative text taken from Wikipedia articles 65

5-2-2. parameters set and tools used. 66

Selecting pairs of candidate sentences: 66

Selecting pairs of parallel sentences: 68

Increasing the accuracy of pairs of extracted sentences: 69

5-2-3. Evaluation of parallel sentences extracted using translation machine. 69

5-3. conclusion 72

5-4. Future offers. 75

Source:

[1]S. Tripathi and J. K. Sarkhel, "Approaches to machine translation", Annals of Library and Information Studies, vol. 57, pp. 388-393, December 2010.

A. Lopez, "statistical machine translation", ACM Computing Surveys, vol. 40, no. 3, pp. 1-49, 2008.

P. F. Brown, J. Cocke, S. A. Della-Pietra, V. J. Della-Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer and P. S. Roossin, “A statistical approach to machine translation”, Comput Linguist, vol. 16, no. 2, pp. 79-85, 1990.

F. J. Och and H. Ney, "Discriminative training and maximum entropy models for statistical machine translation", in 40thNey, "Discriminative training and maximum entropy models for statistical machine translation", in 40th Annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp. 295–302, 2002.

P. Koehn, "Europarl: a parallel corpus for statistical machine translation", in MT Summit X: the tenth machine translation summit, Phuket, Thailand, pp. 79-86, 2005.

M. Mohaghegh, A. Sarrafzadeh and T. Moir, "Improved Language Modeling for English-Persian Statistical Machine Translation", Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation (COLING 2010), Beijing, pp. 75–82, August 2010.

Supreme Council of Information and Communication Technology. (2013). Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

A. Mansouri and H. Faili, "State-of-the-art English to Persian Statistical Machine Translation System", in 16th CSI International Symposium on Artificial Intelligence and Signal Processing, pp. 174-179. IEEE, Fars, 2012.

T. Ishisaka, K. Yamamoto, M. Utiyama and E. Sumita, "Development of a Japanese-English software manual parallel corpus", MT Summit XII: proceedings of the twelfth machine translation summit, Ottawa, ON, Canada, pp. 254-259, 2009.

M. T. Pilevar, A. H. Pilevar and H. Faili, "TEP: Tehran English-Persian Parallel Corpus", In: Gelbukh, A. (eds.) Computational Linguistics and Intelligent Text Processing. LNCS, vol. 6609, pp. 68-79. Springer, Heidelberg, 2011.

F. Jabbari, S. Bakhshaei, S. M. Mohammadzadeh Ziabary and S. Khadivi, "Developing an Open-domain English-Farsi Translation System Using AFEC: Amirkabir Bilingual Farsi-English Corpus", Fourth Workshop on Computational Approaches to Arabic-Script-based Languages ??(AMTA 2012), San Diego, CA, USA, November 2012.

J. Nie, M. Simard, P. Isabelle and R. Dur, "Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web", Proceedings of the 22nd annual international ACMSIGIR conference on research and development in information retrieval (SIGIR '99), Berkeley, CA, pp. 74-81, 1999.

P. Resnik and N. A. Smith, "The web as a parallel corpus", Comput Linguist, vol. 29, no. 3, pp. 349-380, 2003.

Y. Zhang, K. Wu, J. Gao, and P. Vines, "Automatic acquisition of Chinese-English parallel corpus from the Web", Proceedings of 28th European Conference on Information Retrieval, pages 420-431. Lecture Notes in Computer Science, Vol. 3936, Springer, January 2006.

D. W. Oard, "Alternative approaches for cross-language text retrieval", In AAAI symposium on cross-language text and speech retrieval, Stanford, CA, USA, pp. 154-162, 1997.

J. Tiedemann, "Parallel Data, Tools and Interfaces in OPUS", In Proceedings of the 8th International Conference on Language Resources

[16]and Evaluation (LREC'2012), 2012.

R. Zajac, S. Helmreich and K. Megerdoomian, "Black-Box/Glass-Box Evaluation in Shiraz", Workshop on Machine Translation Evaluation at LREC-2000, Athens, Greece, 2000.

R. S. Belvin, W. May, S. Narayanan, P. Georgiou and S. Ganjavi, "Creation of a Doctor-Patient Dialogue Corpus Using Standardized Patients", International Conference on Language Resources and Evaluation (LREC), 2004.

B. Qasemizadeh and S. Rahimi, "The First Parallel Multilingual Corpus of Persian: Toward a Persian BLARK", the second workshop on Computational Approaches to Arabic Script-based Languages ??(CAASL-2), California, USA, 2007.

M. Mohaghegh and A. Sarrafzadeh, "Performance evaluation of various training data in English-Persian Statistical Machine translation", 10th International Conference on the Statistical Analysis of Textual Data (JADT2010), Rome, Italy, 2010.

M. A. Farajian, "Pen: Parallel English-Persian News Corpus", Proceedings of the 2011th World Congress in Computer Science, Computer Engineering and Applied Computing, 2011.

F. Jabbari, S. Bakhshaei, S. M.

How To Access The File

Identifying the appropriate features in the text to resolve semantic ambiguity

Number of pages: 87 Category: Computer Engineering

Master's thesis in the field of computer engineering - software abstract identifying the appropriate characteristics in the text to resolve the semantic ambiguity, it can be boldly claimed that the present age is the age of information explosion and perhaps language can be considered as the most important barrier and obstacle in the transmission of information. Therefore, the ...

Comparative study of learning styles, resilience and problem solving skills in professional and non-professional chess players

Number of pages: 131 Category: Psychology

In order to obtain a master's degree in the field of general psychology, the general aim of this research is to compare learning styles, resilience and problem solving skills in professional and non-professional chess players. The method of this research is descriptive correlational. The statistical population includes all chess players under the supervision of the Chess ...

Examining the psychometric characteristics of the stress questionnaire and its relationship with the general health questionnaire

Number of pages: 144 Category: Psychology

Master's Thesis in Psychology - Assessment and Measurement Abstract The psychometric properties of Harry's stress questionnaire were examined after translation and validity of appearance and content were checked with factor analysis method. A sample of four hundred people from Tehran city were selected by simple random method and available sampling and completed the 66-question ...

The relationship between administrative automation and customer satisfaction of the social security organization of Gilan province

Number of pages: 104 Category: Management

Dissertation for obtaining a master's degree, field: business management, orientation: insurance, abstract: nowadays, the implementation of office automation has provided many benefits for organizations, including their customers. Speed, accuracy, safety, etc. are among these advantages. The purpose of this research is to investigate the effect of administrative automation on ...

Separation of prosthesis projections and tissue implants in spiral CT scan sinogram images using active contour methods

Number of pages: 147 Category: Electronic Engineering

Master's thesis in medical engineering, bioelectrical trend, abstract of the separation of prosthesis projections and tissue implants in sinogram images of spiral CT scan using active contour methods for metal implants such as prostheses and tooth filling materials during the reconstruction of CT images with different methods, creates an artifact that appears as light and dark ...

Separation of prosthesis projections and tissue implants in spiral CT scan sinogram images using active contour methods

Number of pages: 146 Category: Electronic Engineering

Master's Thesis of Medical Engineering, Bioelectrical Orientation Abstract: Separation of prosthesis projections and tissue implants in spiral CT scan sinogram images using active contour methods of metal implants such as prostheses and tooth filling materials during the reconstruction of CT images with different methods creates an artifact that appears as light and dark radial ...

Combining body movement information and foot contact forces to recognize human identity by analyzing walking method

Number of pages: 66 Category: Electronic Engineering

Dissertation for M.Sc degree, direction: relationship between man, machine, computer. Abstract: Today, with the rapid changes in technology, the level of security of individuals and organizations has also decreased, and this security has been invaded by profiteers and vandals. In the meantime, the increasing progress of biometric methods has made tremendous achievements in the ...

Combining body movement information and foot contact forces to recognize human identity by analyzing walking method

Number of pages: 54 Category: Electrical Engineering

Dynamics and adaptive fuzzy control of finite-time sliding mode of hexa-parallel robot using synchronization error

Number of pages: 162 Category: Facilities - Mechanics

Master's thesis in the field of mechanical engineering (applied design) abstract of dynamics and adaptive fuzzy control of limited-time sliding mode of hexa-parallel robots using the synchronization error of the development and increase of industrial production and the subsequent economic and social growth of a society in the new century is tied to the automation of production ...

Dissertation for obtaining a master's degree in the field of educational technology, the effect of the mobile educational agent in multimedia on students' learning, memorization and learning motivation.

Number of pages: 156 Category: Educational Sciences

Abstract: The purpose of this research is to investigate the effect of the presence of mobile educational factor in multimedia on students' learning, memorization and learning motivation in geography. To conduct the research, from among the statistical population of the research, which were all female students in the third year of middle school in district 4 of district 6 of ...

An efficient model for creating a parallel text corpus from a comparative text corpus

Summary of An efficient model for creating a parallel text corpus from a comparative text corpus

Contents & References of An efficient model for creating a parallel text corpus from a comparative text corpus