This is the second and last part of my series which focuses on Anomaly Detection using Machine Learning. If you haven't already, I recommend you read my first article here which will introduce you to Anomaly Detection and its applications in the business world.
In this article, I will take you through a case study focus on Credit Card Fraud Detection. It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. So the main task is to identify fraudulent credit card transactions by using Machine learning. We are going to use a Python library called PyOD which is specifically developed for anomaly detection purposes.
PyOD is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data. It has around 20 outlier detection algorithms (supervised and unsupervised). PyOD is developed with a comprehensive API to support multiple techniques and you can take a look at the official documentation of PyOD here.
If you are an anomaly detection professional or you want to learn more about anomaly detection then I recommend you try using the PyOD Toolkit.
PyOD has useful features such as :
Installing PyOD in Python
Let’s first install PyOD on our machines.
pip install pyod # normal install
pip install --pre pyod # pre-release version for new features
Alternatively, you could clone and run the setup.py file.
git clone https://github.com/yzhao062/pyod.git
cd pyod
pip install .
If you plan to use Neural Network-based Models in Pyod, you have to install Keras and other libraries manually in your machine.
The dataset we will use contains transactions made by credit cards in September 2013 by European cardholders. The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.
Now let us see how we can use the PYOD library in this case study. We will start by importing important packages such as pandas, numpy,sklearn and pyod.
# Import important packages
import pandas as pd
import numpy as np
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score,confusion_matrix
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
# Importing KNN module from PyOD
from pyod.models.knn import KNN
from pyod.models.ocsvm import OCSVM
# Import the utility function for model evaluation
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
from sklearn.preprocessing import StandardScaler
from cf_matrix import make_confusion_matrix
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# set seed
np.random.seed(123)
The dataset for this case study can be downloaded here.
Let's load the dataset.
# Load the dataset from csv file by using pandas
data = pd.read_csv("creditcard.csv")
Check columns in the dataset.
# show columns
data.columns
The dataset contains 31 columns, only 3 columns make sense which are Time, Amount, and Class (fraud or not fraud). The rest of the 28 columns were transformed using PCA dimensionality reduction in order to protect user identities.
# print the shape of the data
data.shape
(284807, 31)
The dataset contains 284807 rows and 31 columns as explained before.
# show the first five rows
data.head()
You can see all transformed columns are named from V1 to V28.
Let's check if we have any missing values in our dataset.
#check missing data
data.isnull().sum()
We don't have any missing values in our dataset.
Our target column is Class contains two classes which are fraud labeled as 1 and not fraud labeled as 0.
# determine number of fraud cases in our file
data.Class.value_counts(normalize=True)
In this dataset, there are only 0.173% (total of 492 )of fraud transactions and 99.82% (total of 284,315) of valid transactions.
We can observe if variables in the dataset are correlated to each other by using the heatmap plot implemented in the seaborn library.
#find the correlation betweeen the variables
corr = data.corr()
fig = plt.figure(figsize=(30,20))
sns.heatmap(corr, vmax=.8, square=True,annot=True)
The above correlation graph shows that V11 variable has a strong positive correlation to the Class variable while the V17 variable has a strong negative correlation to the Class variable.
Because we have many valid transactions, we will use all 10,000 valid cases and 492 fraud cases to create our models.
# use sample of the dataset
positive = data[data["Class"]== 1]
negative = data[data["Class"]== 0]
print("positive:{}".format(len(positive)))
print("negative:{}".format(len(negative)))
new_data = pd.concat([positive,negative[:10000]])
#shuffling our dataset
new_data = new_data.sample(frac=1,random_state=42)
new_data.shape
Positive: 492
Negative: 284315
(10492,31)
Now we have a total of 10492 numbers of rows.
We will standardize the Amount variable by using the standardScaler method from sklearn. StandardSclaer transforms the data to where there is a mean of 0 and a standard deviation of 1, which means standardizing the data into a normal distribution.
#Normalising the amount column.
new_data['Amount'] = StandardScaler().fit_transform(new_data['Amount'].values.reshape(-1,1))
Separate the dataset into independent variables and target variable (class variable).
NB. we are not going to use the time variable in this article.
# split into independent variables and target variable
X = new_data.drop(['Time','Class'], axis=1)
y = new_data['Class']
# show the shape of x and y
print("X shape: {}".format(X.shape))
print("y shape: {}".format(y.shape))
X shape: (10492, 29)
y shape: (10492,)
Split the dataset into train and test sets. We will only use 20% of the dataset for the test set and the rest will be the train set.
#split the data into train and test
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size = 0.2, stratify=y, random_state=42 )
We will create two outlier detectors from PyOD library which are K-Nearest Neighbors Detector and One-class SVM detector.
In KNN detector for any observation, its distance to its kth nearest neighbor could be viewed as the outlying score.
PyOD supports three kNN detectors:
# create the KNN model
clf_knn = KNN(contamination=0.172, n_neighbors = 5,n_jobs=-1)
clf_knn.fit(X_train)
The two parameters we passed into KNN() are
contamination: The amount of anomalies in the data which for our case = 0.0172
n_neighbors: Number of neighbors to consider for measuring the proximity.
After training our KNN Detector model, we can get the prediction labels on the training data and then get the outlier scores of the training data. The higher the scores are, the more abnormal. This indicates the overall abnormality in the data. These features make PyOD a great utility for anomaly detection tasks.
# Get the prediction labels of the training data
y_train_pred = clf_knn.labels_ # binary labels (0: inliers, 1: outliers)
# Outlier scores
y_train_scores = clf_knn.decision_scores_
We can evaluate KNN() with respect to the training data. PyOD provides a handy function for this task called evaluate_print(). The default metrics include ROC and Precision @ n. We will pass class name, y_train values and y_train_scores(outlier scores as returned by a fitted model.)
# Evaluate on the training data e
valuate_print(‘KNN’, y_train, y_train_scores)
KNN ROC: 0.9566, precision @ rank n:0 0.5482.
We see that the KNN() model has a good performance on the training data. Let’s plot the confusion matrix for the train set.
import scikitplot as skplt
# plot the comfusion matrix in the train set
skplt.metrics.plot_confusion_matrix(y_train,y_train_pred, normalize=False,title=”Consfusion Matrix on Train Set”)
plt.show()
372 fraud cases were predicted correctly and only 22 cases were predicted incorrectly as valid cases in the train set.
We will use decision_function to predict anomaly scores of the test set using the fitted detector(KNN Detector) and evaluate the results.
y_test_scores = clf_knn.decision_function(X_test) # outlier scores
# Evaluate on the training data
evaluate_print('KNN', y_test,y_test_scores)
KNN ROC:0.9393, precision @ rank n:0.5408
Our KNN() model continues to perform well on the test set. Let’s plot the confusion matrix for the test set.
# plot the comfusion matrix in the test set
y_preds = clf_knn.predict(X_test)
skplt.metrics.plot_confusion_matrix(y_test,y_preds, normalize=False,
title="Consfusion Matrix on Test Set")
plt.show()
87 fraud cases were predicted correctly and only 11 cases were predicted incorrectly as valid cases in the test set.
This is an unsupervised Outlier detection algorithm and a wrapper of scikit-learn one-class SVM Class with more functionalities.
Let's create a Once-class SVM model.
# create the OCSVM model
clf_ocsvm = OCSVM(contamination= 0.172)
clf_ocsvm.fit(X_train)
After training our OCSVM Detector model, we can get the prediction labels on the training data and then get the outlier scores of the training data.
# Get the prediction labels of the training data
y_train_pred = clf_ocsvm.labels_ # binary labels (0: inliers, 1: outliers)
clf_name ='OCSVM'
# Outlier scores
y_train_scores = clf_ocsvm.decision_scores_
# Evaluate on the training data
evaluate_print(clf_name, y_train, y_train_scores)
OCSVM ROC:0.9651, precision @ rant n:0.7132
OCSVM model performs better than KNN model on the train set. Let’s plot the confusion matrix for the train set.
# plot the comfusion matrix in the train set
skplt.metrics.plot_confusion_matrix(y_train,y_train_pred,
normalize=False,
title="Consfusion Matrix on
Train Set")
plt.show()
373 fraud cases were predicted correctly and only 21 cases were predicted incorrectly as valid cases in the train set.
We will use decision_function to predict anomaly scores of the test set using the fitted detector(OCSVM Detector) and evaluate the results.
y_test_scores = clf_ocsvm.decision_function(X_test) # outlier scores
# Evaluate on the training data
evaluate_print(clf_name, y_test,y_test_scores)
OCSVM ROC: 0.9571, precision @ rank n:0.6633
Our OCSVM model continues to perform well on the test set. Let’s plot the confusion matrix for the test set.
# plot the comfusion matrix in the test set
y_preds = clf_ocsvm.predict(X_test)
skplt.metrics.plot_confusion_matrix(y_test,y_preds, normalize=False, title=”Consfusion Matrix on Test Set”, ) plt.show()
92 fraud cases were predicted correctly and only 6 cases were predicted incorrectly as valid cases in the test set.
In general, when you compare these two models, we observed that the OCSVM model performs better than the KNN model. There is more that can be done to increase the performance of the best model (OCSVM) for detecting fraud transactions. You can also try to use other detector algorithms found on PyOD documentation.
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. As a business owner, you can avoid serious headaches and unwanted publicity by recognizing potentially fraudulent use of credit cards in your payment environment.
The source code for this article is available on Github.
https://github.com/Davisy/Credit-Card-Fraud-Detection-using-PYOD-Library
If you learned something new or enjoyed reading this article, please share it so that others can see it.I look forward to hearing your experience using PyOD Library as well. I can also be reached on Twitter @Davis_McDavid
Also published at https://medium.com/analytics-vidhya/introduction-to-anomaly-detection-using-machine-learning-with-a-case-study-part-two-f78243f74d2f