Multi-Class Sentiment Analysis using LSTM-CNN Network
✅ Paper Type: Free Essay | ✅ Subject: Computer Science |
✅ Wordcount: 5883 words | ✅ Published: 18th May 2020 |
Multi-Class Sentiment Analysis Using LSTM-CNN network
Abstract—In the Data driven era, understanding the feedback of the customer plays a vital role in improving the performance and efficiency of the product or system. Sentiment Analysis plays a major role in understanding the customer feedback especially if it’s a Big Data. Sentiment Analysis allows the machine to understand what the human have conveyed in the text and basically understand the sentiment in the text. Understanding the feedback of the customer is necessary to understand the sentiment of the person towards a particular product or service however since the data are voluminous it is necessary to imply machine learning techniques to understand the sentiment of the text. In this paper the main focus is the fine graining of the classifiers and implementing multiple classes for classification for the feedbacks to give more accurate sentiment of the feedback using Deep learning methods.
Keywords: Natural, Processing, Sentiment, Analysis, Twitter, RNN, Neural Netowrk
- INTRODUCTION
As we ascend into the era of Web 2.0, reviews and feedback are a pivot factor in understanding the performance and reliability of the product. Hence understanding the sentiment of these feedbacks and reviews play an important role but however due to subjective nature of these feedbacks getting the suitable information becomes a mundane process especially if we are dealing with Big Data. The main purpose of sentiment analysis is to excerpt the behaviour of the user towards a particular product or service by using various algorithms and text mining techniques. Basically sentiment analysis involves breaking down of the text into clusters based on particular sentiment words and thus aggregating the sentiment score based on polarity of the opinion. For example, words like “great” and “bad” express the positive and negative opinion respectively and also there can be various levels of sentiment in a particular cluster, for instance “great” and “excellent” both mean positive opinion but “excellent” indicates a higher level of positivity compared to “great”. Also words like “very” prefixed with any sentiment word can affect the level of sensitivity of the word. When we subdivide the classes into subclasses, the accuracy tends to degrade significantly. So far most works revolve around creating a model based on bag of words concept. Here in this paper we discuss about the sentiment to the word level. Thus by making the model learn the sentiment of word level we can significantly improve the performance and sensitivity of the model.
In this paper, we further discuss about fine graining the classes and improving the sentiment analysis of the text. We will discuss about the various machine learning algorithms used in sentiment analysis and also the hybrid approach which fuses the conventional lexicon and the machine learning approach. The CNN-LSTM model is taken as the baseline model for comparison with a LSTM-CNN model. The model is trained across a glove embedding layer and a normal embedding layer based on the input corpus. Word embedding layers are used to convert the words to vector for improving the efficiency of the system. Glove, global vectors used to represent words based on their co-occurrence statistics is used to achieve the word embedding.
- Related Works
Sentiment Analysis has been the area of interest for various scholars around the globe have carried out the popular area of research in the current time and many researches. Tetsuya Nasukawa and Jeonghee Yi [1] in their paper discussed how classifying the whole document as positive or negative could be a bad approach if both opinion subject and document subject are assumed the same. In their paper, they proposed an alternative approach to determine the sentiment for the given subject by identifying the text clusters that determine the sentiment of the subject rather than analyzing the document for polarity. Janyce Wiebe criticized the subjective sentiment classification or sentence level document classification. Identifying only opinioned sentence are not sufficient as two or more similar or contrasting opinions can be presented in a single sentence. Hence he proposed Natural Language Systems that extract texts relevant to a particular sentiment thus distinguishing between credible and non-credible information. Flame detection systems identifies intense tirade and emotional rants but ignores the milder opinions. He proposed a fine-grained annotation schema by performing annotations by word or phrase level rather than document or sentence level [2].
Pang B, Lee L and Vaithyanathan used Naïve bayes, maximum entropy and Support vector Machines to analysis the Sentiment Analysis. It was interesting to find that the performance of the SVM was better than the other methods.
They used the IMDB database to understand the sentiment of the movie reviews. Positive, Negative and Neutral were there three classes they chose to distinguish the sentiment. Unigrams were found to improve the efficiency of the model In [4] Gamon used linear SVM with larger feature corpus and finally reducing the features based on their significance. He showed that analyzing noisy data resulted in bad prediction hence he proposed to identify important features in the noisy data along with their sentiment polarity and feature reduction is done to optimize the performance of the model. A hybrid approach was proposed by Ruchika Aggarwal, Latika Gupta [5] where they discussed how the lexicon based approach and machine learning can be fused to develop the hybrid approach. This suppresses the difficulties of the hand produced rules in lexicon approach and thus makes softdecision which are otherwise extremely difficult, error prone and time consuming if done by humans. Anwar alnawas and nural arici in their paper [6] about sentiment analysis discussed the various approaches implied in it. Broadly, Sentiment analysis can be done by either linguistic approach or machine learning approach. The linguistic approach is further divided into lexicon based and corpus based. The linguistic approach requires the creation of a dictionary using SentiWordNet. This dictionary has opinion words that is used to match with the words present in the review and polarity score is assigned based on the opinion. In
Figure 1
[7] the authors have illustrated the various algorithms in sentiment analysis. The machine language approach is based on analyzing the data and recognizing the patterns in it. However this method is time consuming and requires large quantity of data. Hybrid approach combines both the earlier methods and achieves the best result and good accuracy since it has a supervised approach and stability of lexicon method. Haowei Zhang, Jin Wang proposed a machine learning approach to classify twitter data. They used an LTSM system model in a recurrent neural network. The performance of this system exceptional and this model can be further improvised by implementing different word embedding model and other neural networks [8].
III. Background
Figure 2
- Long Short Term Model:
In the architecture of the LSTM cell, Hochreiter[23] designed a cell which has the capability to preserve the memory states over a long duration of time period and regulating the information over a non-linear gating between in and out of the cell. This mechanism prevents the vanishing gradient problems of Recurrent neural network, enables the cell to effectively remember the long distance dependencies of the sequentially flowing data. [25]
Variable length sentences are transformed in to a fixed length vectors by vectorization and then applied to a Long Short Term Memory cell recursively. Each input word xt of sentence at time step t, the l-memory dimension LSTM cell defines 6 different states in input gate
, forget gate
, output gate
, tanh layer
, memory cell
and hidden state
as follow:
(citing from Taj[24])
The
and
are the multiplication and sigmoid operator. The weight matrixes are the w and u while b is the
bias operator. The forget gate makes the decision whether to remember or forget the previous information in the memory cell. The amount of information that should be exposed to the output is decided by the output gate. The Figure 1 illustrates the LSTM architecture for capturing the sentiment information over sequential vectors.
Figure 3
The figure 2 represents the inside architecture of a LSTM cell.The LSTM cell remembers the previous time step data and makes the results with the overall information at the output gate. For example, say a sentence like “This movie is really good but the climax was bad”, it has both conflicting sentiments which will result in an ambiguous result when given to a feed forward network. However, the LSTM would be able to remember the sentiments of the overall review.
- Convolutional Neural Network
A d x s matrix is used to represent a sentence of length s where d is the dimension of the embedding. The Convolutional Neural Network performs the operation through the linear filters. A weight matrix W of length d and size h is used to represent the filter. The parameters in the weight matrix is the product of length and size of the matrix.
is the convolutional operator is represented by the equation:
Here, i=0,1,2,3…,s-h. The Feature map O is given to a pooling layer to generate the important features. Each max pooling layer captures the most important feature v in that current feature map by choosing the highest value given by the equation
Figure 3 explains the detailed process of the CNN network which involves multiple filters with varying filter dimensions
to obtain multiple max pooling values. The CNN features are obtained by concatenating the max pooling values. The Neural network layer or dense layer is used to provide the connection to these parameters and obtain a high level feature. The sigmoid activation is used to generate the probability distribution for the sentiment classes.
- Regularization
A dropout layer is used to regularize on the penultimate layer with l2-norms of weightzvector. The randomly dropping process by the dropout layer prevents co-adaption of the hidden units. The output y is given by
Where
is the multiplication operator. masking vector is represented by r, z is the output of the penultimate layer. w is the weight vector.
Figure 4
- Research Plan
A. Problem Statement
Given a corpora of reviews, we aim to improve the precision of the classification and predict the sentiment accurately using a LSTM-CNN model. Increasing the classes results in decrease in precision. Hence in this paper I try to fine tune the parameters to improve the efficiency.
B. Data
The data used was collected from the rotten tomatoes platform, a very infamous review website. The dataset consists of movie reviews with their sentiments done by Pang and Lee [1]. It contains more than 150,000 sentences and their sentiment score. The sentiment classes are classified as negative, somewhat negative, neutral, positive and somewhat positive.
Figure 5
In Figure 6 we can see the distribution of the sentiment classes. Class 0 is highly negative, 1 is negative, 2 is neutral, 3 is positive and 4 is highly positive. The data is a normally distributed with most sentence centred around class 2.
- Baseline Model
The baseline model chosen for comparison for our model is a Multinomial Naïve bayes and Support vector machine model [14]. The objective in choosing this model is to determine how the performance varies if the CNN and LSTM are used. Our LSTM-CNN model is designed closely with the CNN-LSTM [13] model with difference in dimension to suit the swapped configuration.
D. Approach
LSTM-CNN Model
The model can be separated into two different parts: (i) feature building representation – involves encoding of the raw data into vectors. (ii) classification layer – this layers tries to classify the labels based on the sentiments. Each model has it’s own mechanism of remember and extract the features from the input vectors. The Convolutional filters are used in a CNN model to capture the features in the data. The filter length determines the amount of data to be extracted from the vectors. However, in a LSTM model, the memory cell preserves the information about the vectors for a certain period of time and hence the final feature is based on the overall vectors that pass through each individual LSTM cell.
Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Find out more about our Essay Writing Service
The proposed model consists of an embedding layer, the lstm cell followed by the CNN network with max pooling layer and a dense layer to connect them all together and get the output classification. The model layers are described below:
- Embedding layer
The embedding layer is the initial layer the vectors are
Introduced to in the model. The embedding layer is used to encode the vector into a high dimensional space. I have used GloVe embedding for vector representation. I used GloVe since it can capture both the global statistics and local statistics of the corpus. This pre trained GloVe vector allows initialization of the vectors. Every movie review is converted into a sequence of integers. We pad each sequence such that all the reviews are of the same length, this avoids dimensionality issues.
- LSTM layer
The memory blocks of the LSTM cell has the output and the various gates that controls the output and manages the memory holding capacity. The cell studies each vector and the sentiment associated with it and stores it for the next time stamp and by this process the overall sentiment of the sentence is determined. The output is passed to the dense and a dropout layer which dropouts the vectors randomly to avoid over fitting.
- Convolutional Layer
N number for filters are used to extract m-gram
Features from the vectors from the LSTM cell. A sliding window of width w infers a w-gram feature can be extracted. The output of the convolutional layer is sent to the max-pooling and dropout layer
- Max-pooling and Dropout layer
The down-sampling of the data from the convolutional layer is done by the max pooling layer and consolidate the final feature captured from the convolutional layer by taking the maximum value from the output of all the filters. The purpose of this process is by evicting the minimum values we can decrease the computation load for the upcoming layers. Also the highest values are the most significant features for determining the sentiments.
In order to avoid the over fitting of the model we include the drop out layer. The dropout layer randomly drops certain amount of data off the model for the next layers thus reducing the chances of overfitting.
IV. Experiment and Evaluation
- Data Pre-Processing
The dataset for provided by the Rotten Tomatoes forum. It is a humongous dataset with 150,000 records each annotated with their respective sentiments. The initial pre-processing involves removing punctuations. Stop words are not removed due to the fact they impact a great deal of sentiment changes in the sentence. For example a sentence, “This is a good movie and I didn’t like it”, this sentence would have a different meaning if the stop word “and” is removed.
Post removing the stop words, the texts are converted into lower cases. This avoids the total number of vectors and thus reducing computation. The next step involves tokenizing the words to convert them into sequence of integers. Thus the words are converted into vectors and as we mentioned earlier we pad them to make them all the same length.
We introduce the pre-trained embedding GloVe vector to set the weight parameter for the embedding layer.
- Implementation
The model was designed in python using the
keras interface which run on top of tensorflow as backend. Here in my practicum i have used the baseline model to be a Bi-directional LSTM model and used the LSTM-CNN model to improve the accuracy of the model. The dataset is divided into train and test in the 80:20 proportion. The best parameters for the model to outperform best is as follows number of epochs is set to 20 with batch size 128. The dimensions of the cell is 100 matching to the dimersion of the embedding layer. The input shape fed into the cell is of the shape total length of input, length of each sequence and one. Return sequence in set to true to allow back propogation. The dense layer is set to 100 dimension as well.
The diemsion of the filter is set to 32 with filter set to three. Relu is used as activation. The ouptut of the CNN layer is sent to the dense layer whose dimension is set to the number of classes which in our model is 5.
V. Results and Evaluation
The evaluation metrics for the model is chosen as spearman’s correlation and we also used the classification report from the sklearn toolkit to obtain the precision recall and F1 score. The spearman correlation coefficient is a technique used to determine the negative or positive relationship between variables. A positive relationship means the two variables are having high rank. Another good measures to determine the performance of the model are the precision accuracy recall and f1 score.
Precision tell us how good the model can determine the predict the positive or true results. Recall calculates how many of the positive results are actually positive. F1 score determine the balance between the precision and recall.
Class |
Precision |
Recall |
F1 score |
0 |
0.02 |
0.74 |
0.03 |
1 |
0.64 |
0.50 |
0.56 |
2 |
0.78 |
0.75 |
0.76 |
3 |
0.63 |
0.55 |
0.59 |
4 |
0.01 |
0.02 |
0.20 |
Avg/total |
0.71 |
0.64 |
0.67 |
Table 1: Model trained using GloVe
The Spearman correlation coefficient score for the model was 0.688.
The accuracy of the model is given below
Dataset |
Accuracy |
Training Accuracy |
0.69 |
Validation Accuracy |
0.65 |
Figure 6
Figure 9
Figure 7
The figure 7 and 8 shows the loss and accuracy curve for the model trained using the glove embedding.
Below is the results of the model trained using it’s own embedding layer using keras and tensorflow.
Class |
Precision |
Recall |
F1 score |
0 |
0.01 |
0.77 |
0.11 |
1 |
0.52 |
0.52 |
0.52 |
2 |
0.81 |
0.72 |
0.76 |
3 |
0.62 |
0.52 |
0.57 |
4 |
0.01 |
0.33 |
0.12 |
Avg/total |
0.71 |
0.64 |
0.67 |
The spearman correlation coefficient score for this model is 0.677
Figure 8
The figure 9 and figure 10 shows the performance of the model trained using its own vector embedding layer.
VI. Discussion and error analysis
We can infer that the model performs differently for different parameter and hence choosing the right parameters requires lots of trial and error methods. Grid Search technique is a very useful method to determine the best parameters. In this paper [9] the author has fined tuned the parameters based on the sklearn grid search. Changing the sequence of the model also has a significant change in the performance.
From the table 1 we can infer that the model has a better recall for the class 2. By looking at the precision of our model we can see that the model isn’t doing it’s best in classifying the labels between highly positive or highly negative. It more likely predicts them as positive or negative. Hence that’s why the precision of the class 0 and class 4 are comparatively less to class 1 and class 3. The performance of the model is better than the baseline model [14]. The model was successful in classifying the neutral, positive and negative sentiments and had little struggle classifying highly positive and highly negative but still the model was able to capture certain sentiment accurately.
IV. CONCLUSION
This paper proposed an alternative efficient approach to perform sentiment analysis and fine graining the classification using the LSTM-CNN model. The output of the research is to build a sentiment classifier and analysis model and implement it without reducing the accuracy and score of the model. The research also provides an insight about various word embedding models and reflects how the efficiency changes upon each model. In future work I will attempt to train the model across other languages.
- REFERENCES
[1] Tetsuya Nasukawa Jeonghee Yi, “Sentiment Analysis: Capturing Favorability Using Natural Language Processing,” K-CAP ’03 Proceedings of the 2nd international conference on Knowledge capture 2003.
[2] Janyce Wiebe and Theresa Wilson, “AnnWiebe, J., Wilson, T. & Cardie, C. Language Res Eval (2005) 39: 165. https://doi.org/10.1007/s10579-005-7880-9
[3] Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentimentclassification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods innatural language processing, pp 79-86.J. M. Soler, F. Cuartero, and M.Roblizo, “Twitter as a tool for predicting elections results,” in Proc.
IEEE/ACM ASONAM, pp. 1194–1200, Aug. 2012.
[4] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas,
“Short text classification in twitter to improve information filtering,” in
Proc. 33rd Int. ACM SIGIR Conf. Research and development in information retrieval, pp. 841–842, July 2010.
[5] Ruchika Aggarwal, Latika Gupta, “A Hybrid Approach for Sentiment Analysis using Classification Algorithm,” in Proc. International Journal of Computer Science and Mobile Computing, pg.149 – 157, IJCSMC, Vol. 6, Issue. 6, June 2017.
[6] Anwar ALNAWAS, Nursal ARICI, “The Corpus Based Approach to Sentiment Analysis in Modern Standard Arabic and Arabic Dialects: A Literature Review,” in Proc. Journal of Polytechnic, 2018;21(2) pg- 461-470, Sept. 2018.
[7] Brian Keithl, Exequiel Fuentes and P Claudio Meneses, “Analyzing internet slang for sentiment mining,” in Proc. 2nd Vaagdevi Int. Conf. Inform. Technology for Real World Problems, pp. 9–11 Dec. 2010.
[8] Haowei Zhang, Jin Wang, Jixian Zhang, Xuejie Zhang, “YNU-HPCC at SemEval 2017 Task 4: Using A Multi-Channel CNN-LSTM Model for Sentiment Classification,” in Proc. 11th International Workshop on Semantic Evaluations (SemEval-2017), pages 796–801, Aug. 2017.
[9] Tai, Kai Sheng, Richard Socher, and Christopher D. Manning. “Improvedsemantic representations from tree-structured long short-term memorynetworks.” arXiv preprint arXiv:1503.00075, 2015
[10] Hochreiter, Sepp, and J¨urgen Schmidhuber. “Long short-term memory.”Neural computation 9.8: 1735-1780, 1997.
[11] Goller, Christoph, and Andreas Kuchler. “Learning task-dependent dis- tributed representations by backpropagation through structure.” NeuralNetworks, 1996., IEEE International Conference on. Vol. 1. IEEE, 1996.[26] Boureau, Y-Lan, Jean Ponce, and Yann LeCun. “A theoretical analysis
Pang and L. Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pages 115–124.
[12] You Zhang, Hang Yuan, Jin Wang, and Xuejie Zhang. 2017. Ynu-hpcc at emoint-2017: Using a cnn-lstm model for sentiment intensity prediction. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 200–204.
[13] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 1631, page 1642. Citeseer, 2013.
Cite This Work
To export a reference to this article please select a referencing stye below:
Related Services
View allDMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please click the following link to email our support team::
Request essay removal