Hey Folks! In this article, I walk you through the sentiment analysis of Amazon Electronics Product Reviews.
Before we move forward, let’s download the dataset that we'll use in this project.
You can download the dataset from here: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz. The download size of the dataset is 1.2GB. The dataset is zipped, so you first need to unzip it. Now the size of the dataset is around 2.5GB. It may be possible that this dataset would not open in your Microsoft Excel.
If you still want to open you can use Delimit software for it. Here is the download link: http://delimitware.com/download.html.
The dataset contains these columns/features:
reviewerID
— ID of the reviewer, e.g. A2SUAM1J3GNN3Basin
— ID of the product, e.g. 0000013714reviewerName
— name of the reviewervote
— helpful votes of the reviewstyle
— product metadata, e.g., “Format” is “Hardcover”reviewText
— text of the reviewoverall
— rating of the productsummary
— summary of the reviewunixReviewTime
— time of the review (unix time)reviewTime
— time of the review (raw)image
— images that users post after they have received the productThe dataset has lots of features, but for sentiment analysis, we need review and rating.
import numpy as np
import pandas as pd
import random
import os
import json
import sys
import gzip
from collections import defaultdict
import csv
import time
#nltk libraries and packages
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import wordnet as wn
#Ml related libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection, naive_bayes, svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score as AUC
After reading the dataset as a pandas data frame, we create a dataset with id, review, and rating of product for sentiment analysis.
#reading the json file in a list
values=[]
with open("Electronics_5.json","r") as f:
for i in f:
values.append(json.loads(i))
print(values[:5])
We saved our filtered dataset in the Electronic_review.csv file.
Now we read our Electronic_review data into a data frame:
#read the dataset into a df
colnames = ["id","text","overall"]
df= pd.read_csv("Electronic_review.csv",names= colnames,header = None)
The division of sentiment, based on vote value, is as follows
Let’s save this data frame as processedData.csv.
newdf.to_csv("processedData.csv",chunksize=100000)
Let’s see how our processed data look like:
df = pd.read_csv("processedData.csv",nrows = 100000)
print(df.head(5))
let’s import some important libraries:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import wordnet as wn
import nltk
nltk.download("stopwords")
import re
nltk.download("punkt")
Now read the processedDatat.csv:
df= pd.read_csv(“processedData.csv”)
Stemming algorithms work by cutting off the end of the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful on some occasions, but not always, and that is why we affirm that this approach presents some limitations.
Developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced, and the results provided in the information retrieval process will be more accurate.
lat_df = df[:100000]
lat_df.to_csv("CurrentUsedFile.csv")
We saved the first 100,000 rows of data as CurrentUsedFile.csv so that we can easily process the data.
#importing the new dataset
lat_df = pd.read_csv("CurrentUsedFile.csv")
print(lat_df.head(5))
#create x and y => x:textreview , y:sentiment
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(lat_df['reviewText_final'],lat_df['Sentiment'],test_size=0.2,random_state = 42)
print(Train_X.shape,Train_Y.shape)
print(Test_X.shape,Test_Y.shape)
Test_Y_binarise
= label_binarize(Test_Y,classes = [0,1,2])
# Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comaprison to the df
from sklearn.feature_extraction.text import TfidfVectorizer
Tfidf_vect = TfidfVectorizer(max_features=500000) #tweak features based on the dataset
Tfidf_vect.fit(lat_df['reviewText_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)
Before going ahead, let’s create a model evaluation function:
def modelEvaluation(predictions, y_test_set):
#Print model evaluation to predicted result
print ("\nAccuracy on validation set: {:.4f}".format(accuracy_score(y_test_set, predictions)))
print ("\nClassification report : \n", metrics.classification_report(y_test_set, predictions))
print ("\nConfusion Matrix : \n", metrics.confusion_matrix(y_test_set, predictions))
Naive Bayes Model:
# Classifier - Algorithm - Naive Bayes
# fit the training dataset on the classifier
import time
second=time.time()
Naive = naive_bayes.MultinomialNB()
historyNB = Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
modelEvaluation(predictions_NB, Test_Y)
from sklearn.metrics import precision_recall_fscore_support
a,b,c,d = precision_recall_fscore_support(Test_Y, predictions_NB, average='macro')
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)
print("Precision is: ",a)
print("Recall is: ",b)
print("F-1 Score is: ",c)
Support Vector Machine (SVM) Model:
asvm,bsvm,csvm,dsvm = precision_recall_fscore_support(Test_Y, predictions_SVM, average='macro')
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
print("Precision is: ",asvm)
print("Recall is: ",bsvm)
Decision Tree Model:
third=time.time()
decTree = DecisionTreeClassifier()
decTree.fit(Train_X_Tfidf, Train_Y)
y_decTree_predicted = decTree.predict(Test_X_Tfidf)
modelEvaluation(y_decTree_predicted, Test_Y)
That’s all.
Also published on Medium's sameerbairwa