An accessible introduction to ML - no programming or math required. By the end of this tutorial, you’ll have implemented your first algorithm without touching a single line of code. You’ll use Machine Learning techniques to classify real data using basic functions in Excel. You don’t have to be a genius or a programmer to understand machine learning. Despite the popularized applications of self-driving cars, killer robots, and facial recognition, the foundations of machine learning (ML) are quite simple. This is a chance to get your feet wet and understand the power of these new techniques.
All of the data scientists are probably cringing at the title of this tutorial. Excel is generally considered to be a terrible tool for serious data analytics. It does not scale to process the large datasets we deal with in the real world and it lacks some key functionality of programming languages and machine learning libraries. You’ll see a lot of the formulas given in this tutorial are complicated to accommodate for the shortfalls and peculiarities of Excel. The reason I’m using Excel is to make this introduction accessible for non-programmers since most of us have basic knowledge of the tool. Those that choose to pursue Machine Learning and Data Science more seriously will eventually upgrade to using Python or R, but there’s no harm in starting simple.
The end goal of this tutorial is to use Machine Learning to build a classification model on a set of real data using an implementation of the k-nearest neighbors (KNN) algorithm. Don’t get overwhelmed, let’s break down what that means bit by bit.
Machine Learning is a collection of techniques to optimize models. In other words, Machine Learning takes the models we’ve built and uses real world data to “learn” how to fine tune the parameters of the model to be most useful in a real world scenario based on the training data. In this tutorial we’ll be applying machine learning to a classification model. Don’t worry if you’re not fully clear right now, by the end of the tutorial you’ll know exactly what I’m talking about.
Training Set vs Test Set
Machine Learning algorithms adapt the model based on a set of training data. Training data is a data set that contains all of the variables we have available as well as the correct classification. Training sets can be developed in a variety of ways but in this tutorial, we’ll be using a training set that was classified by a human expert. It’s important to remember that machine learning models are only as good as the training data. The more accurate your training data and the more of it you have the better. In other words — garbage in, garbage out.
A test set is typically a subset of the training data in that it also contains all variables and the correct classifications. The difference is in how we use it. While the training set helps to develop the model, the test set tries it out in a real world scenario and sees how well it fares. There are lots of complicated ways to measure error and test models but as long as you get the basic idea we can keep going.
Classification Models
A Classification Model is simply a mathematical tool to determine what category or class of something you’re dealing with based on a set of variables or inputs. For example, if I wanted to classify whether an animal was a cat or a fish, I might use variables such as whether or not the animal swims, whether or not it has fur, and whether or not it eats to determine which class it falls under. You’ll notice two things. Firstly, the more variables you have the better. With more information, you can be more confident that your classification is correct.
Secondly, some variables are more useful or predictive than others. Take the last example, whether or not the animal eats. The casual observer knows that both fish and cats eat, so having this piece of data isn’t useful in determining the class of the animal. The goal of machine learning in this context is to create the most useful classification model given the available data and to weed out the inputs that don’t improve the effectiveness of the model.
K Nearest Neighbors
K-Nearest Neighbors (KNN) is a specific type of Classification Model. The intuition is simple to understand. The model takes all of the data available about an unknown data point and compares it to a training set of data to determine which points in that training set the unknown point is most similar, or closest, to. The idea is that the unknown data point will most likely fall under the same class as the known data points it is most similar to. KNN is simply a mathematical way to determine the similarity between two data points.
The Iris Data Set
For this tutorial, we’ll be using a classic data set used to teach machine learning called the Iris Data Set. This is a collection of data about three species of the Iris flower and four pieces of data about them: sepal length, sepal width, petal length, and petal width.
The data set has already been prepared to make it easy for beginners to jump right in. You can download the data in a compatible excel format at this link by clicking “download zip” in the top right and opening the contents in Excel.
As I mentioned, this data set is meant to be simple to work with. Each of the first 4 columns (A-D)is a dimension, or feature, of the data. The fifth column, E, is the variety or the class of the flower. Each row is its own record, or data point. As you can see we have 150 known data points to work with.
We have an important decision to make: how do we want to segregate this data set into a training set and a test set. Given a bigger data set, there are optimization techniques we could use to make this decision. Since this data set is small and made for beginners, we’re just going to split it 70/30 as a matter of convention. In other words, we will use 70% of the data, or 105 data points as a training set and the remaining 45 data points as a test set.
We will now use Excel to randomly sample 70% of the data. First, add a column to your sheet called “Random Value” and use the RAND() function to randomly select a value between 0 and 1. Keep in mind that the RAND() function will re-select a new number each time your sheet recalculates.
To avoid that, after generating my numbers I’m going to copy them (Ctrl+C) and then special paste over them as values (Ctrl+Shift+V) so that they stay fixed. We will start in cell F2 and drag down to the last data point.
=RAND()
Next I’m going to rank them 1 to 150 using Excel’s RANK() function, starting in cell G2 as shown below and dragging all the way down to the last data point. Make sure to lock the reference frame as shown by hitting F4 or adding the $ signs manually or else this formula won’t work as we intend.
=RANK(F2, $F$2:$F$15)
We now have a unique value between 1 and 150 for each data point. Because we want 105 values for our training set, we’re going to add one more column and select the values ranked 1 through 105 for our training set using a quick IF() function.
Otherwise, we will add the value to our test set. Again, we will start at H2 and drag down to the last data point.
=IF(G2<=105,”Training”, “Test”)
At this point your data set should be set up like the screenshot. Remember that because we each took a different random sample the specific values in columns F-H will look different for you. You should also take a minute to add filters for our next step.
Next we will break our two sets of data into their own worksheets (or tabs) to keep things organized. Create a new worksheet called “Training Set” and filter for the “Training” data in the original worksheet. Copy this data along with the headers and paste it into your “Training Set.” You should have 106 rows (105 values + the header row). Do the same for the worksheet “Test Set.” You should have 46 rows (45 values + the header row).
At this point you can get rid of the “Iris” worksheet and delete columns F-H in both of your remaining worksheets since we’ve already segregated our data. Finally, I will add an “ID” column at the start of each worksheet and label each data point 1–105 and 1–45 respectively by simply typing in the number (dragging down the fill handle will be your friend here to save you work).
This will help us do our calculations in the next sections. Make sure each of your sets is arranged like the example below.
Our data is now ready and we can proceed to build our model. As a reminder, this model works by comparing the unknown data point we wish to classify to its nearest, or most similar, neighbors. To do that we will need to take each point in our test set and calculate its distance to each point in the training set.
The Concept of Distance
Distance is the way mathematicians determine which points are most similar in an n-dimensional space. The intuition is that the smaller the distance between the points the more similar they are. Most of us are used to calculating distance in a 2-dimensional space, such as an x,y coordinate system or using longitude and latitude.
There are several ways to calculate distance but to keep it simple we’re going to use the Euclidean distance. Below is a visualization of the Euclidean distance formula in a 2-dimensional space. As you can see, the formula works by creating a right triangle between two points and determining the length of the hypotenuse, the longest side of the triangle, as identified by the arrow.
Our data set is 4-dimensional. It’s difficult for us to visualize spaces beyond 3 dimensions, but regardless of whether or not you can visualize it we can still calculate the distance between two points the same way regardless of the number of dimensions.
Here is the generic formula for Euclidean Distance:
In plain language this is saying is that the Euclidean distance between two points, q & p, can be determined by taking each dimension for each point, starting with the first dimension, and squaring the difference between them iteratively until you’ve done so for all dimensions and added the differences together.
Then we take the square root of that sum and we have the Euclidean distance. It sounds complicated but you’ll see that it is actually quite simple to use once we get back into our data.
Calculating The Distance
In our workbook, create a new worksheet called “Distance.” Our goal for this sheet is to create a 45X105 matrix of the distances between each data point in the test set and the training set. In our case, each row will correspond to one data point in the test set and each column will correspond to one data point in the training set. Starting in A2 and working down line by line until you hit A46, fill each cell with the numbers 1–45.
Again, the fill handle is useful here so you don’t have to type the numbers one by one. Now, working from B1 and then column by column horizontally across until you hit DB1, fill each column with the numbers 1–105. Your matrix should look something like the screenshot below which shows a small portion of it.
Before moving on, you’ll need to convert your matrix to a Table so we can keep things organized. Select your entire matrix and hit Ctrl+T and then name the table “Distance_Table” and select to create the table with Headers. Next, you’ll want to name your first Column “Test ID” by typing that into cell A1.
Now that our table is set up we can start our calculations. We’ll start in cell B2 which will calculate the distance between the first point in our Training Set (ID #1) and the first point in our Test Set (ID #1). We can apply the Euclidean distance formula quickly by using the VLOOKUP function in excel to find the values for each dimension and then preforming the calculations as necessary.
It’s best to copy and paste this formula into your formula bar in Cell B2 as it handles a couple peculiarities of the Table feature in Excel, but make sure you understand that all this formula is doing is applying the Euclidean Distance formula we discussed earlier. As written you can then drag this to fill your entire table.
=SQRT(((VLOOKUP(NUMBERVALUE(Distance_Table[[#Headers],[1]]), ‘Training Set’!$A$1:$F$106, 2, FALSE)-VLOOKUP(Distance_Table[@[Test ID]:[Test ID]], ‘Test Set’!$A$1:$F$46, 2, FALSE)) ^ 2+(VLOOKUP(NUMBERVALUE(Distance_Table[[#Headers],[1]]), ‘Training Set’!$A$1:$F$106, 3, FALSE)-VLOOKUP(Distance_Table[@[Test ID]:[Test ID]], ‘Test Set’!$A$1:$F$46, 3, FALSE)) ^ 2+(VLOOKUP(NUMBERVALUE(Distance_Table[[#Headers],[1]]), ‘Training Set’!$A$1:$F$106, 4, FALSE)-VLOOKUP(Distance_Table[@[Test ID]:[Test ID]], ‘Test Set’!$A$1:$F$46, 4, FALSE)) ^ 2+(VLOOKUP(NUMBERVALUE(Distance_Table[[#Headers],[1]]), ‘Training Set’!$A$1:$F$106, 5, FALSE)-VLOOKUP(Distance_Table[@[Test ID]:[Test ID]], ‘Test Set’!$A$1:$F$46, 5, FALSE)) ^ 2))
You should end up with something like this:
Finding Nearest Neighbors
At this stage we have calculated the distance between every point in our test set and every point in our training set. Now we need to identify the closest neighbors to each point in our test set. Create a new worksheet called “Nearest Neighbors” and starting at A2 work down line by line to fill the cells with the numbers 1–45 to correspond with the points in our Test Set. Our columns are not going to represent the Training Set like they have on previous sheets. Instead, these are going to represent the 6 closest neighbors, starting with the 1st closest and then the second closest and so on. The 1st closest neighbor has the smallest distance, the 2nd closest neighbor has the second smallest distance and so on. Your sheet should look like this:
As we have before we will write a formula in cell B2 that can be dragged to fill the rest of our matrix. Our approach is to identify the smallest value in our corresponding row (2) in the distance table, find the column number for that value, and then return the column name since that will give us the ID of the value in the Training Set.
We will use a combination of the Index and the Match functions to achieve this. Note that we’re able to make this formula simple because we had the foresight to set up our Distance matrix as a table in Excel and so we can easily pull in the headers.
=INDEX(Distance_Table[#Headers], MATCH(SMALL(Distance!$B2:$DB2, 1), Distance!2:2, FALSE))
Drag this formula to fill the top row of your nearest neighbors matrix. You will need to manually adjust the bold value in the SMALL() function to represent the neighbor we’re looking for. So for example, to find the second nearest neighbor the formula would be as follows.
=INDEX(Distance_Table[#Headers], MATCH(SMALL(Distance!$B2:$DB2, 2), Distance!2:2, FALSE))
Remember your values will be different since your random sample used to form the Test Set is different from mine.
At this stage, I usually take a minute to double check one of the rows manually when feasible just to make sure my formulas are working as expected. At scale you’ll want to use automated testing, but for now we’re keeping it simple.
We have one last step: we need to identify the classification of each of our nearest neighbors. We’ll go back to the formula in B2 and modify it to do a VLOOKUP of the ID in the Training Set and return the classification. We’ll then drag that to fill the matrix.
=VLOOKUP(NUMBERVALUE(INDEX(Distance_Table[#Headers], MATCH(SMALL(Distance!$B2:$DB2, 1), Distance!2:2, FALSE))), ‘Training Set’!$A$1:$F$106, 6, FALSE)
Taking A Step Back
Let’s take a step back and look at what we’ve accomplished. You’ve now identified for each point in your test set the classification for the 6 nearest neighbors. You will likely notice that for all or almost all of your data points the 6 nearest neighbors will all fall into the same classification. This means that our data set his highly clustered. In our case, our data is highly clustered for two reasons.
Firstly, as we discussed at the start of the tutorial the data set is designed to be easy to work with. Secondly, this is a low-dimensional data set since we are only working with 4 dimensions. As you deal with real-world data, you will typically find that it is far less clustered especially as the number of dimensions increases. The less clustered your data, the larger the training set will need to be to build a useful model.
If our data was always as neatly clustered as the Iris Data Set there would be no need for machine learning. We would simply find the nearest neighbor using our formula and use that to determine the classification of each unknown data point. Since this is not usually the case, machine learning helps us more accurately predict the classification of an unknown data point by looking at multiple neighbors at once.
But how many neighbors should we look at? That’s where the “K” in K-Nearest Neighbors comes in. K describes the number of neighbors we’ll consider when predicting the classification of an unknown data point.
Too Few or Too Many Neighbors
Intuitively, it’s important to understand why this problem is tricky. It is possible to look at too few neighbors and also too many neighbors. Especially as the number of dimensions increase, it is possible that the nearest neighbor is not always the correct classification. Looking at too few neighbors limits the amount of information your model has available to make its determination.
Considering too many neighbors will actually degrade the quality of the information your model uses as an input. This is because as more neighbors are introduced you are also introducing noise to the data. Just think about it — it wouldn’t make sense to consider all 104 neighbors in our example! See a visual representation of this concept below.
Thus this becomes a classic optimization problem where we attempt to find the K value that gives the most information without being too high or too low.
Using Your Test Set
For this tutorial, we’ll use a very simple process of trial & error to determine the optimal K value. Before we move on, I recommend looking at your Nearest Neighbors worksheet and making a guess as to what the best k value might be, just for fun. We’ll find out soon enough if you’re right!
Setting Up The Algorithm
An algorithm is just a set of steps for a computer to repeat over and over again according to a defined set of rules. In this case, we will tell the computer to try different K values, calculate the rate of error for each one using our test set, and then ultimately return the value that produces the lowest error rate.
To do this we’ll need to create a new worksheet called “KNN Model.” We’ll set it up as follows, labeling rows A4 through A48 with 1–45 for each of our test data points.
Let's start with the predicted value in Column B. We need this formula to adjust based on the K value. In the case taht the K value is 1, the formula is simple, we just take the closest neighbor.
=’Nearest Neighbors’!B2
In the case that the K Value is greater than 1, we’re going to take the most common neighbor that appears. If the occurrence of neighbors is equally distributed, for example if 3 of the neighbors are Setosa and 3 of the neighbors are Virginica when K=6, we’ll side with the classification of the closest neighbor.
The formula for K=2 would be as follows. We use IFERROR because this formula returns an error when there are two neighbors that occur an equal number of times for the given K value.
=IFERROR(INDEX(‘Nearest Neighbors’!B2:C2,MODE(MATCH(‘Nearest Neighbors’!B2:C2,’Nearest Neighbors’!B2:C2,0))), ‘Nearest Neighbors’!B2)
You’ll want to use the expanded formula below in cell B4 which enables you to use K values up to and including K=6. No need to worry about the specifics of this formula, just copy and paste it. By the way, having to use complicated, finicky, and hard to understand formulas like these are one of the limitations of Excel I was referring to earlier.
This would have been a piece of cake in Python. Note that this formula will return an error if there is a no value in K or a value not between 1 and 6. You should copy this formula from cell B4 down Column B.
=IFS($B$1=1, ‘Nearest Neighbors’!B2, $B$1=2, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$C$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$C$2,’Nearest Neighbors’!$B$2:$C$2,0))), ‘Nearest Neighbors’!B2), $B$1=3, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$D$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$D$2,’Nearest Neighbors’!$B$2:$D$2,0))), ‘Nearest Neighbors’!B2), $B$1=4, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$E$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$E$2,’Nearest Neighbors’!$B$2:$E$2,0))), ‘Nearest Neighbors’!B2), $B$1=5, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$F$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$F$2,’Nearest Neighbors’!$B$2:$F$2,0))), ‘Nearest Neighbors’!B2),$B$1=6, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$G$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$G$2,’Nearest Neighbors’!$B$2:$G$2,0))), ‘Nearest Neighbors’!B2))
Next, we want to pull in the actual, known classification of each test point so we can determine if our model was right or not. For this we use a quick VLOOKUP in Column C, starting in cell C4 and dragging down.
=VLOOKUP(A4, ‘Test Set’!$A$1:$F$46, 6, FALSE)
Then we’ll set up a formula in Column D to return a 1 if the prediction was incorrect, or in error, and a 0 if the prediction was correct. You’ll start in cell D4 and drag the formula down.
=IF(B4=C4, 0, 1)
Finally we’ll calculate the error rate by dividing the number of errors by the total number of data points, using this formula in cell B2. As a matter of convention we will format this as a percentage.
=SUM(D4:D48)/COUNT(D4:D48)
Running the Algorithm
We’re now ready to run our algorithm for different K values. Because we’re only testing 6 values, we could do it by hand. But that would be no fun and more importantly doesn’t scale. You’ll need to enable the Solver Add-In for Excel following the instructions in this article before we proceed.
Now, navigate to the Data ribbon and click the Solver button. The solver button does the trial and error for us automatically according to our instructions. You’ll have a dialogue box of parameters, or instructions, which you’ll want to set up as shown below.
We’re setting it up so that it seeks to minimize the error rate while testing values between 1 and 6, only testing integer values.
Excel will spin for a minute and you may see it flash a few values on your screen before getting this dialogue box. You should click OK to Keep Solver Solution.
Interpreting The Error Rate and Solver Solution
Many optimization algorithms have multiple solutions due to the fact that the data has multiple minima or maxima. This happened in my case. In fact, in my particular case, all integer values 1 through 6 represent minima with an error rate of approximately 2%. So what do we do now?
A few things run through my head. First, this test set isn’t very good. The model didn’t gain any optimization benefits from the test set and as such, I would probably re-do the test set and try again to see if I get different results. I’d also consider using more sophisticated methods of testing such as cross validation.
At an error rate this low in my test set, I also start to worry about over-fitting. Over-fitting is a problem that occurs in machine learning when a model is too tailored to the nuances of a particular training or test data set. When a model is over-fit it is not as predictive or effective when encountering new data in the wild.
Of course, with an academic data set like this we’d expect our error rate to be fairly low.
The next consideration is which value to choose if I have identified several minima. While the test wasn’t effective in this particular example, generally I would pick the lowest number of neighbors that is at a minima to conserve computing resources. My model will run faster if it has to consider fewer neighbors. It won’t make a difference with a small data set but decisions like this conserve substantial resources at scale.
Kudos! You’ve learned the basics of machine learning and implemented the KNN algorithm all without leaving the confines of Excel. Remember that Excel is merely a tool and that the important part is that you understand the intuition and concepts that make this approach work.
Understanding the fundamentals will help you as you dive deeper into Data Science and Machine Learning and start to develop your own models.
Previously published at https://towardsdatascience.com/machine-learning-made-simple-with-excel-1bffd7901502