Presenting a feature-based model to analyze the sentiment in texts

Number of pages: 74 File Format: word File Code: 31019
Year: 2014 University Degree: Master's degree Category: Computer Engineering
  • Part of the Content
  • Contents & Resources
  • Summary of Presenting a feature-based model to analyze the sentiment in texts

    Master's Thesis in Computer Engineering (Software)

    First Chapter Preface

    1-1- Introduction

    Some authors define data mining as a tool to search for useful information in a large amount of data. To perform the data mining process, we encounter various research fields, such as database, machine learning and statistics. Databases are essential for analyzing large amounts of data. Machine learning is an area of ??artificial intelligence that allows computers to learn by analyzing data sets by creating techniques. The focus of these methods is on symbolic data and deals with the analysis of experimental data. Its basis is statistical theory. In this theory, uncertainty and chance are modeled by probability theory. Today, many statistical methods are used in the field of data mining. It can be said that text mining uses the techniques of information retrieval, information extraction as well as natural language processing and relates them to algorithms and methods of data mining, machine learning and statistics. According to different research areas, different definitions of text mining can be considered for each of them. Some of these definitions are given below:

    Text mining = information extraction: In this definition, text mining is considered corresponding to information extraction (extracting facts from the text).

    Text mining = textual data discovery: text mining can be defined as methods and algorithms of Considered machine learning and statistical fields for texts with the aim of finding useful patterns. For this purpose, it is necessary to pre-process the texts. In many methods, information mining methods, natural language processing or some simple pre-processing are used to extract data from texts, then data mining algorithms can be applied on the extracted data. Text mining = knowledge extraction process: which is fully explained in the previous section and will not be described here. In this research, we mostly consider text mining as textual data discovery and focus more on the methods of extracting useful patterns from text to categorize text collections or extract useful information. In the current world, the problem is not the lack of information, but the problem of the lack of knowledge that can be obtained from this information. Millions of web pages, millions of words in digital libraries and thousands of pages of information in every company are just a few of these sources of information. But it is not possible to specifically introduce a source of knowledge in between. Knowledge is a summary of information, as well as a conclusion and the result of thinking and analyzing information. 

    Data mining is a very efficient way to discover information from structured data stored in tables. Data mining extracts patterns from transactions, groups and categorizes data. By data mining, we can find out the relationships between the data items that fill the database. At the same time, we have a problem with data mining, and that is the lack of commonality in its application. Most of our knowledge is completely unstructured, if not non-digital. Digital libraries, news, e-books, many financial documents, scientific articles, and almost anything else you can find on the web are unstructured. As a result, we cannot apply data mining teachings to them directly. However, there are three basic methods in dealing with this vast amount of unstructured information, which are: information retrieval, information mining, and natural language processing. 

    Information recovery: It is basically related to the recovery of documents and documents. The usual work in information retrieval is to pull out the most relevant texts and documents, or in fact, words from among other documents of a collection, according to the needs raised by the user. It's not finding knowledge, it's just handing over those bits of words that it thinks are most relevant to the searcher's information needs. This method does not really bring us knowledge or even information. Natural language processing: The general goal of natural language processing is to achieve a better understanding of natural language by computers. Robust and simple techniques are used for fast text processing. Linguistic analysis techniques are also used to process the text.

      Information extraction: The goal of information extraction methods is to extract specific information from text documents. Information extraction can be used as a pre-processing phase in text mining. Information mining involves mapping natural language texts (eg reports, journal articles, newspapers, e-mails, web pages, any text database, etc.) to a predefined structured representation or templates that, when filled, reveal a selection of key information from the original text. Once the information is extracted and then the information can be stored in the database for future use. Today, despite the large amount of textual information, text mining is one of the research-commercial methods that is of particular importance. All commercial companies, producers of goods, service providers and politicians are able to receive useful knowledge as feedback from their goods, services and performance by using the process of text mining.  Among the applications of text mining, the following can be mentioned: 1. Spam identification: analyzing the title and content of a received email, to determine whether the email can be spam or not. 2. Surveillance: It means monitoring the behavior of a person or a group of people secretly. A project called ENCODA monitors telephone, internet and other means of communication to identify terrorism.

    3. Pseudonym detection: Pseudonyms in medical care are analyzed to identify fraud. For example, an invoice may be presented as John Smith, J. Smith and Smith, John. In this way or by using other methods, claimants will find the possibility of abuse and receive many premium claims under different aliases. Using text mining to detect these aliases can help insurance companies a lot in finding fraud.

    4. Summarizing: The meaning of summarizing is the process of extracting and presenting a set of basic concepts from the text, in just a few lines. This can make it easier for users to check the contents of the documentation and speed them up on the way to what they need.

    5. Relationships between concepts: among the facts that can be obtained from a set of texts, the connection and dependence of some concepts with other concepts. These facts can say, for example, that the appearance of some words may depend on the appearance of some other words. This means that whenever we see the first set of words, we can expect to see the second set of words as well. This concept is also borrowed from data mining in the database.

    6. Finding Behavior Analysis: To illustrate this application, assume that you are the manager of a business company. Obviously, you should always monitor the activities of your competitors. It can be any kind of information that you got from the news, stock market transactions or from the documents produced by the same competitor company. Today, information is increasing exponentially, managing all these data sources is definitely not possible with the help of eyes alone. Text mining allows you to automatically find new behaviors and changes.  In fact, what should be expected from text mining is to tell you what news among a range of news is related to what you want, and which news is new, what developments are taking place in your field of work, and what are your current interests and behaviors and how are they changing. By using this information, managers are able to profit from the discovered information to check the competitor's situation.

    7. Sentiment analysis: In this application, the purpose of text mining is to identify the author's feelings. The degree of satisfaction or happiness and unhappiness of the author is recognized. This thesis will examine the text mining in order to analyze the feeling in the texts, so we will analyze the feeling in the texts in more detail.

    All the textual information can be classified into two categories: facts[1] and opinions[2]. Facts are scientific and practical statements about entities, events and their characteristics that exist objectively and truly in the outside world or have happened. Opinions are non-objective and subjective expressions that express people's opinions, evaluations or feelings about an entity, event and their characteristics [23].

  • Contents & References of Presenting a feature-based model to analyze the sentiment in texts

    The first chapter of the preface. 1

    1-1- Introduction. 2

    1-3- Analyzing the feeling in the text. 6

    1-4- Objectives of the treatise. 8

    1-5- work method. 9

    1-6- thesis structure. 9

    The second chapter of the works done 10

    2-1- Introduction. 11

    2-2- Definition of the problem. 11

    2-3- The first step of analyzing the feeling in the text. 12

    2-4- Methods based on N-gram features. 13

    2-5- feature selection algorithms. 18

    The third chapter of the proposed method. 22

    3-1- Preface. 23

    3-2- Required resources. 23

    3-3- The first proposed method. 25

    3-3-1.               Pre-processing of documents 26

    3-3-2.               Tagging speech habits. 29

    3-3-3.               Feature vector extraction and feature combination 30

    3-3-4.                Apply feature selection algorithm. 33

    3-4- The second proposed method 34

    3-5- The third proposed method 37

    3-5-1.               Word polarity extraction and feature vector filter. 38

    Chapter 4 implementation and results obtained 47

    4-1- Introduction. 48

    4-2- Data collection 48

    4-3- Data classification 48

    4-4- Results of the first method. 49

    4-5- The results of the second method 52

    4-6- The results of the third method 53

    4-7- Comparison of the proposed method with previous methods. 53

    4-8- The results of applying the proposed method for the Persian language. 54

    4-9- Future works 58

    References and sources. 59

     

    Source:

     

    [1] A. Abbasi, S. France, Z. Zhang, H. Chen; "Selecting Attributes for Sentiment Classification Using Feature Relation Networks.", IEEE Transactions on Knowledge and Data Engineering 23, pp. 447–462 (2011).

     

    [2] A. Ahmed, H. Chen, A. Salem; "Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums," ACM Trans. Information Systems, vol. 26, no. 3, article no. 12, 2008

    [3] A. Abbasi, H. Chen, S. Thoms, T. Fu; "Affect Analysis of WebForums and Blogs Using Correlation Ensembles" IEEE Trans.Knowledge and Data Eng.,vol. 20, no. 9, pp. 1168-1180, Sept. 2008.

     

    [4] B. Pang, L. Lee, S. Vaithyanathan; "Thumbs up? Sentiment classification using machine learning techniques.", Empirical Methods in Natural Language Processing (EMNLP), pp. 79–86, (2002).

     

    [5] B. Agarwal, N. Mittal; "Optimal Feature Selection Methods for Sentiment Analysis", 14th International Conference on Intelligent Text Processing and Computational Linguistics, Vol-7817, pages-13-24, 2013.

     

    [6] C.E. Shannon; "A Mathematical Theory of Communication," Bell Systems Technical J., vol. 27, no. 10, pp. 379-423, 1948.

    [7] C. Priyanka, G. Deepa, "Identifying the Best Feature Combination for Sentiment Analysis of Customer Reviews" International Conference on Advances in Computing, Communications and Informatics (ICACCI), India, pp. 102 – 108, Aug 2013. [8] C.E. Shannon, "A Mathematical Theory of Communication," Bell

    Systems Technical J., vol. 27, no. 10, pp. 379-423, 1948.

    [9] E.Andrea and S.Fabrizio, "SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining," In Proceedings of the 5th Conference on Language Resources and Evaluation, LREC'06, page 417-422, 2006.

    [10] E. Riloff, S. Patwardhan, and J. Wiebe, “Feature Subsumption for Opinion Analysis,” Proc. Conf. Empirical Methods in Natural Language Processing, pp. 440-448, 2006. [11] J.R. Quinlan; "Induction of Decision Trees", Machine Learning, vol. 1, no. 1, pp. 81-106, 1986. [12] J. Wiebe, T. Wilson, R. Bruce, M. Bell, and M. Martin; "Learning Subjective Language", Computational Linguistics, vol. 30, no. 3, pp. 277-308, 2004.

     

    [13] J. Blitzer, M. Dredze, F. Pereira; "Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification.", Proceedings of the”, Proceedings of the Association for Computational Linguistics (ACL), pp. 440–447 (2007).

     

    [14] J. Yi, T. Nasukawa, R. Bunescu, and W. Niblack, “Sentiment Analyzer: Extracting Sentiments about a Given Topic Using Natural Language Processing Techniques,” Proc. Third IEEE Int'l Conf. Data Mining, pp. 427-434, 2003. [15] J.R. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.

     

    [16] K. Tsutsumi, K. Shimada, and T. Endo, “Movie Review Classification Based on Multiple Classifier,” Proc. 21st Pacific Asia Conf. Language, Information, and Computation, pp. 481-488, 2007.

    [17] L. Bing, Z. Lei “Mining Text Data”, springer, USA, 2012.

    [18] L. Yu and H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution,” Proc. 20th Int'l Conf. Machine Learning, pp. 856-863, 2003. [19] L. Yu and H. Liu, “Efficient Feature Selection via Analysis of Relevance and Redundancy,” J. Machine Learning Research, vol. 5, pp. 1205-1224, 2004. [20] M. Gamon; "Sentiment Classification on Customer Feedback Data: Noisy Data, Large Feature Vectors, and the Role of Linguistic Analysis," Proc. 20th Int'l Conf. Computational Linguistics, pp. 841-847, 2004. [21] M. Hall, L.A. Smith; "Feature Subset Selection: A Correlation Based Filter Approach," Proc. Fourth Int'l Conf. Neural Information Processing and Intelligent Information Systems, pp. 855-858, 1997.

     

    [22] M. Ghiassi, J. Skinner, D. Zimbra: “Twitter brand sentiment analysis: A hybrid system using N-gram analysis and dynamic artificial neural network”, Expert Systems with Applications, 40, (2013) 6266–6282

    [23] p. Bo, Lillian Lee, "Opinion Mining and Sentiment Analysis", Information Retrieval, Vol. 2, Nos. 1–2, pp. 1–135, (2008)

     

    [24] T. Zhang, D. Tao, X. Li, and J. Yang, “Patch Alignment for Dimensionality Reduction,” IEEE Trans. Knowledge and Data Eng., vol. 21, no. 9, pp. 1313-1299, Sept. 2009

     

    [25] V. Ng, S. Dasgupta, S.M.N. Arifin "Examining the Role of Linguistic Knowledge Sources in the Automatic Identification and Classification of Reviews", Conf. Computational Linguistics, Assoc. for Computational Linguistics, pp. 611-618, 2006. [26] Z. Fei, J. Liu, G. Wu; "Sentiment Classification Using Phrase Patterns", Proc. Fourth IEEE Int'l Conf. Computer Information Technology, pp. 1147-1152, 2004. [27] WEKA. Open Source Machine Learning Software Weka, http://www.cs.waikato.ac.

Presenting a feature-based model to analyze the sentiment in texts