Data is becoming increasingly important in almost every conceivable field and area. From business and healthcare to law enforcement and sports, data is central to their operations. It’s not enough to simply collect information, however. Instead, you need to make good use of it, and this is where data science comes into play. Anomaly detection is one of the most interesting applications of data science, and in this article, I will provide a brief overview of what it is, and how it can be used.
Anomaly detection involves identifying the differences, deviations, and exceptions from the norm in a dataset. It’s sometimes referred to as outlier detection (i.e., looking at a dataset to identify any outlying or unusual datapoints, data groups, or activity).
For example, credit card companies collect data on everything we purchase, including the amount of money we spend, where we spend it, what we spend it on, how frequently we make purchases, and more.
Anomaly detection makes this data not only useful but powerful. This is because anomaly detection algorithms analyse all the data above to identify fraudulent credit card activity within seconds of a transaction taking place.
There are many applications for anomaly detection:
Of course, there are many more applications of anomaly detection than those listed above. The crucial point is the fact that anomaly detection is becoming increasingly important and that it enables data to be used in ways it never was before.
Not all anomalies are equal. In fact, they can be split into three broad categories:
Point anomalies, Collective anomalies, Contextual anomalies
Let’s look at each in more detail.
A point anomaly is where a single datapoint stands out from the expected pattern, range, or norm. In other words, the datapoint is unexpected.
Examples of point anomalies.
We can again turn to credit card spending as an example. A point anomaly would be a single, high-value, and unusual spend on a credit card. For example, normal spending on a credit card might be small to medium-sized amounts, typically within the same geographical area and usually to purchase a similar group of products.
A point anomaly would be if that credit card was then used to make a single high-value purchase for an item not previously purchased and purchased in a location where the credit card has not been used before.
This point anomaly could indicate fraudulent activity.
Scikit-learn has good support for point anomaly detection algorithms, such as LOF.
A collective anomaly occurs where single datapoints looked at in isolation appear normal. When you look at a group of these datapoints, however, unexpected patterns, behaviours, or results become clear.
An irregular heart beat is an example of a collective anomaly.
Those unexpected occurrences could be, for example, events occurring in an order or in a combination that is unexpected.
We can use potential credit card fraud as our example again. This would occur where multiple purchases appear to fit within normal spending activity when looked at individually. When you look at these purchases as a group, however, unusual patterns and behaviour can appear.
Instead of looking at specific datapoints or groups of data, an algorithm looking for contextual anomalies will be interested in unexpected results that come from what appears to be normal activity.
The crucial element here is context: Are the results out of context?
Example of contextual anomaly detection using the Twitter AnomalyDetection package in R
A good example in this instance is a network intrusion attempt. An algorithm looking for contextual anomalies will have a baseline of activity that provides it with normal parameters. This could, for example, show the expected levels of traffic accessing the network at various times of the day.
Traffic might be at its lowest in the early hours of the morning. Therefore, a spike in traffic at 3 a.m. is a contextual anomaly that warrants further action and/or investigation as it could indicate a network intrusion attempt.
In other words, accessing the network in itself was a completely normal event. Having such high levels of traffic at three in the morning, however, was unusual; it was an anomaly.
Anomaly detection is one of the most interesting applications of data science. There are many applications of anomaly detection like fraud detection, social media monitoring, and more. Anomaly detection makes this data not only useful but powerful.