Analyze customer feedback using aspect-based sentiment analysis

Contents

This article was published as part of the Data Science Blogathon

has been a platform for customers to give feedback to companies based on their satisfaction. Customer reviews are the world's trusted source of genuine content for other users. Customer feedback serves as a third-party validation tool to build trust in the user's brand.. To understand these customer feedback about an entity, sentiment analysis is becoming an augmentation tool for any organization.

Sentiment analysis involves examining online conversations such as tweets, blog posts or comments on particular services or topics and segregate user opinions (positive, negative and neutral), allowing companies to identify customer sentiment towards products. Helps companies with a deep pulse on how customers really are “they feel” about your brand and process large amounts of data in an efficient and cost-effective way. By automatically analyzing customer feedback, from survey responses to social media conversations, brands can listen carefully to their customers and tailor products and services to meet their needs.

Sentiment analysis can be classified as detailed, emotion detection, aspect-based sentiment analysis and intention analysis. Detailed sentiment analysis addresses interpretation polarity in review, while emotion detection involves the emotional expression of the user about a product.

Aspect-based sentiment analysis is a variety of sentiment analysis that helps in business improvement by knowing the characteristics of your product that need to be improved based on customer feedback to make your product a bestseller.. ABSA identifies the aspects in the review given about a product and also finds if the aspect mentioned in the review belongs to what kind of sentiment.

In this article, we will conduct ABSA using SemEval's laptop and restaurant dataset 2014, as well as in multilingual data sets such as the Hindi data set on products such as laptops, phones, restaurants and hotels.

Data preprocessing

Tokenización: Tokenization is the division of the text paragraph into smaller chunks, as sentences (sentence tokenization) or words (word tokenization). The main drawback of word tokenization is words without vocabulary (OOV), to avoid OOV and also to extract information from the tokenization of text sentences that is used in this analysis.

Remove empty words: After tokenization, stopwords are identified and removed from tweets. Stopwords are the most common words in a language that may not add much information to the sentence or document. These words are filtered to minimize noise and improve the quality of text data for better classification.. The NLP library contains a collection of stopwords for each language of the text in NLTK. The words in the text are compared to this stopword list, match words are removed to improve data quality and also to easily extract sentiment words from tweets.

Remove punctuation and character: After expanding contractions, special characters and punctuations are removed by regex function. The main reason for doing this is because punctuation or special characters are often not very important when analyzing the text and they use it to extract characteristics or information based on NLP and ML.

Replacement of negation with antonyms: Replacing negative words with antonyms decreases the dimensionality of the document matrix word count, so it is beneficial to compress the vocabulary without losing its meaning to save memory.

from nltk.corpus import wordnet
class AntonymReplacer(object):
 def replace(self, word, pos=None):
  antonyms = set()
  for syn in wordnet.synsets(word, pos = pos):
    for lemma in syn.lemmas():
      for antonym in lemma.antonyms():
        antonyms.add(antonym.name())
  if len(antonyms) == 1:
    return antonyms.pop()
  else:
    return None
 def replace_negations(self, sent):
  i, l = 0, len(sent)
  words = []
  while i < l:
    word = sent[i]
    if word == 'not' and i+1 < l:
      ant = self.replace(sent[i+1])
      if ant:
        words.append(ant)
        i += 2
        continue
    words.append(word)
    i += 1
  return words

Spell correction: Words that have multiple repeating characters and incorrect spelling that occur due to human typing errors should be removed, since they do not matter in general. For instance, words like finallyyy, exactly, etc. are incorrect entries, but nevertheless, must be corrected for later use.

Lematización: Stemming is the most common text preprocessing technique used for word normalization. Lemmatizing a word converts the word to its basic meaningful form by observing the morphological analysis of each word. Stemming is also similar to stemming, but the first does not take into account the context of the word in the sentence and only removes the suffix in the words.

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
antreplacer = AntonymReplacer()
def clean_text(text):
    #Lemmatizing the texts
    # removing aphostrophe words
    text = text.lower() if pd.notnull(text) else text
    text = re.sub(r"what's", "what is ",str(text)) 
    text = re.sub(r"'s", " ", str(text)) 
    text = re.sub(r"'and", " have ", str(text)) 
    text = re.sub(r"can't", "cannot ", str(text)) 
    text = re.sub(r'ain't', 'is not', str(text)) 
    text = re.sub(r'won't', 'will not', str(text)) 
    text = re.sub(r"n't", " not ", str(text)) 
    text = re.sub(r"i'm", "i am ", str(text)) 
    text = re.sub(r"'re", " are ", str(text)) 
    text = re.sub(r"'d", " would ", str(text)) 
    text = re.sub(r"'ll", " will ", str(text)) 
    text = re.sub(r"'apologies", " excuse ", str(text)) 
    text = re.sub('W', ' ', str(text)) 
    text = re.sub('s+', ' ', str(text)) 
    # Remove punctuations and numbers
    text = re.sub('[^ a-zA-Z]', ' ', str(text)) 
    # Single character removal
    text = re.sub(r"s+[a-zA-Z]s+", ' ', str(text)) 
    text=lemmatizer.lemmatize(text)
    # replacing negation words with antonyms
    text=antreplacer.replace(text)
    # Removing multiple spaces
    text = re.sub(r's+', ' ', str(text)) 
    text = text.strip(' ')
    return text

Classifier Models

Classifier Models Sentiment Analysis

Keying is the method of representing the words in the sentence as vectors. The embedding technique that we will use will be the GloVe embedding, constructing word co-occurrence matrices. English sentences are trained with pre-trained GloVe inlays and inlays for Hindi sentences are custom trained with 13M Hindi corpus data.

def get_word2vec_embedding_matrix(model):
    embedding_matrix = np.zeros((vocab_size,300))
    for word, i in tokenizer.word_index.items():
        try:
            embedding_vector = model[word]
        except KeyError:
            embedding_vector = None
        if embedding_vector is not None:
            embedding_matrix[i]=embedding_vector
    return embedding_matrix

After Sentence Words Are Converted to Vectors with GloVe Embedding, bi-directional LSTM and CNN models are applied on the keying layer to train and predict aspect terms and sentiment terms, respectively. The 1000 Most commonly used aspect terms are identified in the data set and the Bi-LSTM model is trained and classified among these aspect classes. Predicted aspect terms are labeled BIO. The sentiment of the found aspect term is predicted using the CNN model to classify the review as positive., negative and neutral.

embed_dim = 128
lstm_out = 196
model = Sequential()
model.add(Embedding(10000, embed_dim,input_length = 28))
model.add(Bidirectional(LSTM(lstm_out,return_sequences=True)))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(attention())
model.add(Flatten())
model.add(Dropout(0.3))
model.add(Dense(1000, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])
model.summary()
81934capture1-7861367
history_object = model.fit(trainX, trainY, epochs=5,batch_size=8)
62642capture2-5484574

Summary

In this article, we have applied various preprocessing techniques to text revisions and words are converted to vector representations using GloVe embedding. The embedded layer is added with the bi-directional LSTM layer to find the aspect terms in the sentence and Bahdanau's attention is applied to find the association between the target and the context words. Find the sentiment polarity for each aspect term found in the model above and predicted using the CNN model to classify the aspect term as positive., negative or neutral. Aspect terms that are predicted from the sentence are tagged with BIO tagging, namely, Beginning, intermediate or outside the aspect term.

The complete code for this mini-project is available here.

Final notes

I hope you enjoyed reading this article.

Happy learning!!

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.