In this article, we build a machine-learning model to guess the tone of customer reviews based on historical data. It is a classification problem solved with Natural Language Processing (NLP).
The purpose of NLP is to teach languages to computers by developing algorithms and models to allow them to read and understand the text. Furthermore, it allows to generate text.
The article contains six parts:
In each part, there is a Python snippet explaining how it works.
Overall, you will understand the steps to resolve a sentiment analysis classification problem with NLP.
The first step in our Natural Language Processing journey is text preprocessing. It means transforming the original text into a more suitable text processable in our machine learning algorithms. Usually, we perform several actions, such as:
Tokenization: Breaking down the text into simple words (tokens). It facilitates text processing and analysis.
Remove punctuation: Punctuation does not contribute a lot to the meaning of the sentence. We can remove commas, periods, quotation marks, etc.
Remove stopwords: Stopwords are articles, pronouns, and conjunctions. We can remove them as punctuation.
Lowercasing/uppercasing: Usually, we lowercase text to avoid duplicating the same tokens. For example, tokens such as “Beatles” and “Beatles” will be considered the same.
Stemming: Transform words to their base form. For example, “sing” and “sang” becomes “sing”.
There are other transformations, such as removing special characters, using text representation for numbers, etc.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
nltk.download('punkt')
nltk.download('stopwords')
def preprocess(text):
tokens = word_tokenize(text)
tokens = [token.lower() for token in tokens if token not in string.punctuation]
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
preprocessed_text = ' '.join(tokens)
return preprocessed_text
We start by importing the necessary libraries, stopwords, word_tokenize and string. We write our preprocessing function to perform the tokenization, removing punctuations, lowercasing, and removing stopwords. Finally, we join tokens to form a sentence.
With the following text example:
text = "They have built a home, sweet home with a couple of kids running in the yard"
It transforms the text raw to “built home sweet home couple kids running yard”.
Computer programs understand numbers better than words. We need to transform our data into numbers.
There are several solutions, I will use CountVectorizer to split data into vectors.
from sklearn.feature_extraction.text import CountVectorizer
data = [
("I like their pedagogy and training programs", "positive"),
("There are a lot of great features.", "positive"),
("I do not recommend this school.", "negative"),
("It is a classic school, nothing special.", "neutral")
]
texts, labels = zip(*data)
preprocessed_texts = [preprocess(text) for text in texts]
countVectorizer = CountVectorizer()
X = countVectorizer.fit_transform(preprocessed_texts)
To illustrate our use case, we save our 4 customer reviews. 2 positives, 1 negative and 1 neutral inside a variable.
We create our CountVectorizer and then call the fit_transform method.
fit_transform learns the vocabulary of the input data by analyzing the text and identifying unique words. After learning it, it transforms the input data into numerical representation creating a matrix that contains the number of recurrences for each word in the vocabulary.
With our example, printing X matrix gives:
print(countVectorizer.get_feature_names_out())
print(X.toarray())
#print below
['and' 'are' 'classic' 'do' 'features' 'great' 'is' 'it' 'like' 'lot'
'not' 'nothing' 'of' 'pedagogy' 'programs' 'recommend' 'school' 'special'
'their' 'there' 'this' 'training']
[[1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1]
[0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0]
[0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0]
[0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0]]
Our matrix X contains a bunch of 0 and **1.**To understand better, we will take the first element:1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1Corresponding to our first sentence: “I like their pedagogy and training programs”.
We have 22 words in our vocabulary, the first word is “and”. It appears one time inside the first sentence, so we save 1.
The second word in our vocabulary is “are” and does not appear inside the first sentence, so we save 0. For each sentence, we continue as described before.
Now, we are going to train and evaluate our model. We use Logistic Regression. To have accurate results, I create my dataset product_reviews.csv, available here. It contains 50 positive reviews, 50 negatives, and 10 neutrals.
So, we just need to replace the data variable we previously hardcoded with the new dataset.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import csv
!wget -cv https://raw.githubusercontent.com/walterwhites/machine_learning/main/customer_reviews.csv
data = pd.read_csv('customer_reviews.csv')
texts = data['review'].tolist()
feeling = data['feeling'].tolist()
countVectorizer = CountVectorizer()
X = countVectorizer.fit_transform(texts)
Then we split our data into test and train datasets and call the fit method to train the model. I choose to include 30% of the data in the training dataset.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X, feeling, test_size=0.3, random_state=40)
model = LogisticRegression()
model.fit(X_train, y_train)
After training our model, we need to evaluate our model, we can use classification_report.
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
It returns four metrics:
Precision: Precision is the ratio of accurate prediction. In our use case, it means when the model predicts a negative, it succeeds 42% of the time.
Recall: The ratio of correctly predicted positive observations to the total actual positive observations. For 78% positive feelings, it means the model finds 78% of the positive feelings inside the dataset.
F1-score: The F1-score is the mean of precision and recall. It gives good insights when there is a class imbalance.
Support: Support indicates the number of samples belonging to each class.
We see that the evaluation is not that great. By analyzing a little bit, we understand it is due to our poor dataset. It does not contain a lot of reviews.
To improve the performance of our model, we need more data. I prepared a wider dataset: https://github.com/walterwhites/machine_learning/blob/main/customer_reviews_wide.csv
We adapt the code to import the new dataset. Then we rerun the next steps.
!wget -cv https://raw.githubusercontent.com/walterwhites/machine_learning/main/customer_reviews_wide.csv
data = pd.read_csv('customer_reviews_wide.csv')
Now we re-evaluate our model.
The results are much better. The program succeeds in detecting the feeling for almost each customer review. With a f1-score up to 0.96 and an accuracy of 0.95.
We are going to test our model with unseen reviews.
def predict_feeling(text):
preprocessed_text = preprocess(text)
text_representation = countVectorizer.transform([preprocessed_text])
feeling = model.predict(text_representation)[0]
return feeling
john_feeling = "This website is normal."
paul_feeling = "I am so happy, the product I received is exceptional."
george_feeling = "I did not like the product I received, I asked for a refund"
predicted_feeling_john = predict_feeling(john_feeling)
predicted_feeling_paul = predict_feeling(paul_feeling)
predicted_feeling_george = predict_feeling(george_feeling)
print(predicted_feeling_john)
print(predicted_feeling_paul)
print(predicted_feeling_george)
“This website is normal.”: The program qualifies this review as neutral.
“I am so happy, the product I received is exceptional.”: The program qualifies this review as positive.
“I did not like the product I received, I asked for a refund”: The program qualifies this review as negative.
You can retrieve the full code inside my Github repo: https://github.com/walterwhites/machine_learning/blob/main/Analyse Customer Reviews with Natural Language Processing(NLP).ipynb
We saw how to resolve our classification problem using NLP. We use CountVectorizer and logistic regression. We could go further, handling more complexity in our data using the BERT method (Bidirectional Encoder Representations from Transformers).
We use precision, recall, and F1-Score. They provide good insights into the performance of our model based on our dataset.
Even with a good classification score (e.g., equal to 1), it does not mean the model is perfect. We still can struggle with the Overfitting issue.
To resolve the overfitting issue, you may use BERT, improve the dataset quality or use other algorithms (e.g., Support Vector Machine (SVM), Random Forest, etc.) according to your use case and input data.
Also published here.