This article was published as part of the Data Science Blogathon
In this work, we present the relationship of model performance with a variable dataset size and a variable number of target classes. We have conducted our experiments on Amazon product reviews.
The dataset contains the title of the review, the review text and ratings. We have considered grades as our exit class. What's more, we have carried out three experiments (polarity 0/1), three classes (positive, negative, neutral) and five classes (rating of 1 a 5). We have included three traditional and three deep learning models.
Machine learning models
1. Logistic regression (LR)
2. Support Vector Machine (SVM)
3. Naive-Bayes (NB)
Deep learning models
1. Neural convolution network (CNN)
2. Short term memory (LSTM)
3. Recurring unit closed (CRANE)
2. The data set
We used Amazon's product review dataset for our experiment. The data set contains 3,6 million instances of the product review in text form.
Separate train files have been provided (3M) and test (0,6M) in CSV format. Each instance has 3 attributes. The first attribute is the classification Come in 1 Y 5. The second attribute is the qualification Of the review. The last one is the revision text.
In the table 1 there are some cases. We have considered only 1,5 M.
3. Experiment for analysis
We have already mentioned that the whole experiment was performed for binary classification, three-class and five-class. We have done some pre-processing steps on the dataset before passing it on to the classification models. Each experiment was performed incrementally.
We start our training with 50000 instances and we increase to 1,5 million instances for training. Finally, we record the performance parameters of each model.
3.1.1 Label mapping:
We have considered the grades as the class for the revision text. The grade attribute range is between 1 Y 5. We need to assign these grades to the number of classes, considered for the particular experiment.
a) Binary classification: –
Here, we assign the grade 1 Y 2 to the class 0 and rating 4 Y 5 to class 1. This form of classification can be treated as a feeling classification problem., where the reviews with 1 Y 2 grades are in negative class and 4 Y 5 they are in positive class.
We have not considered reviews with a rating of 3 for the binary classification experiment. Therefore, we get fewer instances for training compared to the other two experiments.
b) Classification of three classes: –
Here, we extend our previous classification experiment. Now we consider the rating 3 as a new separate class. The new assignment of grades to classes is as follows: Qualification 1 Y 2 assigned to class 0 (Negative), Classification 3 assigned to class 1 (Neutral) and rating 4 Y 5 assigned to class 2 (Positive).
Instances for class 1 they are far inferior to the class 0 Y 2, which creates a class imbalance problem. Therefore, we use micropromedio when calculating performance measures.
c) Five-class classification: –
Here, we consider each grade as a separate class. The assignment is as follows: Classification 1 assigned to class 0, Classification 2 assigned to class 1, Classification 3 assigned to class 2, Classification 4 assigned to class 3and the classification 5 assigned to class 4.
3.1.2 Review text pre-processing:
Amazon product reviews are in text format. We need to convert the text data to a numeric format, which can be used to train models.
For machine learning models, we convert the revision text into TF-IDF vector format with the help of the sklearn Library. Instead of taking each word individually, we consider the n-gram model while creating the TF IDF vectors. The range of N-grams is set to 2 – 3, and the maximum function value is set to 5000.
For deep learning models, we need to convert the text sequence data to number sequence data. We apply the word to vector modeling to convert each word to the equivalent vector.
The data set contains a large number of words; Thus, hot coding 1- it is very ineffective. Here we have used a pre-trained word2Vec model to represent each word with a column vector of size 300. We set the maximum length of the sequence equal to 70.
Reviews with a word length less than 70 are padded with leading zeros. For reviews with a word length greater than 70, we select the first 70 words for word2Vec processing.
3.2 Classification model training
We mentioned earlier that we have taken three traditional machine learning models (LR, SVM, NB) and three deep learning models (CNN, LSTM, CRANE). Preprocessed text with label information is passed to templates for training.
In the beginning, we train the six models with 50000 instances and we test them with 5000 instances. For the next iteration, we add 50000 Y 5000 more instances in the train and the test set, respectively. We perform 30 iterations; Thus, We consider 1.5 M, 150000 instances for the train and test set in the last iteration.
The training mentioned above is done for all three types of classification experiments..
3.3 Model configurations
We have used the default hyperparameter settings of all the traditional classifiers used in this experiment. And CNN, the input size is 70 with an inlay size of 300. Key layer skip is set to 0.3. A 1-D convolution has been applied to the input, with the convolution output size set to 100. The kernel size is kept at 3. resume The activation function has been used in the convolution layer. For the grouping process, maximum grouping is used. Adam's Optimizer is used with a cross-entropy loss function.
LSTM and GRU also have the same hyperparameter settings. Output layer size changes with running experiment.
3.4 Performance measures
Scoring formula F
We have taken the F1 score to analyze the performance of the classification models with different class labels and instance count. There is a trade-off between Precision and Recall if we try to improve Recall, Precision would be compromised, and the same applies in reverse.
The F1 score combines precision and recall in the form of a harmonic mean.
Precision, micro-average recovery formula
In the classification of three and five classes, we observe that the count of instances with qualification 3 is much lower in comparison to other ratings, what creates the class imbalance problem.
Therefore, we use the concept of micro-averaging when calculating performance parameters. Micro-averaging takes care of class imbalance when calculating accuracy and recall. For detailed information on Precision, Recall, visit the following link: enlace wiki.
4 Results and observation
In this section, we have presented the results of our experiments with different sizes of data sets and the number of class labels. A separate graph has been presented for each experiment.. Charts are plotted between test set size and F1 score.
What's more, we provide the Figure 5, containing six subplots. Each subplot corresponds to a classifier. We have presented the rate of change between the performance scores of two experiments with respect to the variable size of the test set.
The figure 2 presents the performance of the classifiers in the binary classification task. In this case, the actual size of the test is smaller than the data we have taken for the test because the rated reviews were removed 3.
Machine learning classifiers (LR, SVM, NB) run consistently except for slight variations in starting points.
The deep learning classifier (GRU and CNN) starts out with less performance compared to SVM and LR. After three initial iterations, GRU and CNN continually dominate machine learning classifiers.
LSTM makes learning more effective. LSTM started with the lowest performance. As the training set crosses 0.3 M, LSTM has shown continued growth and ended with GRU.
Fig 2. Binary classification performance analysis
The figure 3 presents the results of the three-class classification experiment. Performance of all classifiers degrades as classes increase. The performance is similar to binary classification if we compare a particular classifier with others.
The only difference is in the performance of LSTM. Here, LSTM has continuously increased performance unlike binary sorting. LR worked a bit better compared to SVM. Both LR and SVM performed equally in the binary classification experiment..
Fig 3. Three-class classification performance analysis
The figure 4 represents the results of the five-class classification experiment. The results follow the same trends that appeared in the binary and three-class classification experiment.. Here the performance difference between LR and SVM increased a bit more.
Therefore, we can conclude that the performance gap of LR and SVM increases as the number of classes increases.
From the general analysis, deep learning classifiers perform better as training dataset size increases and traditional classifiers show consistent performance.
Fig 4. Five-class ranking performance analysis
The figure 5 represents the performance ratio with respect to the change in the number of classes and the size of the data set. The figure contains six subplots, one for each classifier. Each subplot has two lines; one line shows the performance relationship between the three-class classification and the binary class on different sizes of data sets.
Another line represents the performance relationship between the five-class and three-class classification on different sizes of data sets.. We have found that traditional classifiers have a constant rate of change. For traditional classifiers, rate of change is independent of the size of the data set. We cannot comment on the behavior of deep learning models with respect to change in numerical classes and dataset size due to variable pattern in rate of change.
In addition to the experimental analysis mentioned above, we have analyzed misclassified text data. We found interesting observations that affected the classification.
Sarcastic words: –
Customers wrote sarcastic reviews about the products. They have given the rating 1 O 2, but they used a lot of positive polarity words. For instance, a client rated 1 and wrote the following review: “¡Oh! What a fantastic charger I have !! ”. The classifier got confused with these kinds of polarity words and phrases.
Use of high polarity words: –
Clients have given an average rating (3) but they used very polarized words in their reviews. For instance, Fantastic, tremendous, notable, pathetic, etc.
Use of weird words: –
Having a dataset of size 3.6M, we still find many uncommon words, that affected the performance of the classification. Spelling errors, acronyms, short form words used by reviewers are also important factors.
5. Final note
We have analyzed the performance of traditional machine learning and deep learning models with different sizes of data sets and the number of the target class.
We have found that traditional classifiers can learn better than deep learning classifiers if the data set is small.. With the increase in the size of the dataset, deep learning models get a performance boost.
We have investigated the rate of change in the performance of the binary to three classes Y three classes to five classes problem with variable dataset size.
The media shown in this article is not the property of DataPeaker and is used at the author's discretion.