Categorical Encoding - Undoubtedly, is an integral part of data pre-processing in machine learning. Although the most common categorical encoding techniques are Label and One-Hot encoding, there are a lot of other efficient methods which students & beginners often forget while treating the data before passing it into a statistical model. Target encoding is one such efficient encoding technique that is widely accepted as one of the pre-eminent approaches to treat categorical variables. In this article, we will discuss the intuition, examples, pros and cons of this technique in a cogent manner.
Mean target encoding is a special kind of categorical data encoding technique followed as a part of the feature engineering process in machine learning.
Machine learning algorithms do not understand the data as categorical variables in the form of a string. A categorical variable is one that has two or more categories. For example, gender is a categorical variable having two categories (male and female).
Any machine learning algorithm will have to represent these variables in multidimensional space as vectors in order for various purposes like hyperplane generation, calculating the distances, grouping the data points, etc.
When they are in the form of strings as mentioned in the above examples, it is quite impossible to accommodate them in the multi-dimensional space with uniquely identifiable axes points. Therefore, we have to convert (encode) them into numerical values corresponding to each category. Target means encoding is one such technique to do this task. Let us understand the working procedure behind this technique through a real-life example.
Consider that we need to predict whether a person is male or female using three independent variables-
Our data looks like this -
Here, “Age” and “Salary” are numerical columns. Hence, we don't need any encoding for those columns. However, “Profession” is a categorical column with 3 categories (Engineer, Lawyer and Nurse). Hence, we need to do the target mean encoding for this column.
In the target column ie “Gender”, we have 2 classes - “Male” and “Female”. We can consider “Male” as a positive class and “Female” as a negative class for this problem.
First of all, we need to measure the frequency of the “Profession” categories for the positive and negative target classes.
For example,
Let us take the case of Engineers.
We have a total of 5 engineers and 4 of them are Males (positive class). Hence, we need to consider the fraction of positive class for this particular category ie ⅘ = 0.8
Similarly, we can do this task for the Lawyers
Fraction = ½ = 0.5
The same process has to be done for the Nurses
Fraction = 1/3 = 0.3
The final inference after this procedure will look like this-
Now, we can just replace these numerical values for each category in the feature column-
We can eliminate the original column since we have the encoded values with us-
Now, we can see that all the features are turned into numerical columns. Hence, we can proceed with this data for implementing any machine learning algorithm.
Now, we have used the target column values for encoding our feature column. Obviously, you will be having the doubt that how can we encode the feature column of test data. Because, In test data, we will not be having access to target column values.
The answer is simple.
We have to do all the above processes in the training set. Once, we derive the mean target value for each category like this-
We can just replace the values for these categories with the mean target values in the test data also.
Since there is a chance for target leakage due to the impact of the target column in the encoded values, researchers should take care of the appropriate measures and precautions to make sure that the entire target information is not exposed to the feature variable. Some researchers prefer cross-validation to overcome this challenge.
For example,
Let’s assume that we split our data into 4 folds. We will keep one fold as the holdout set and calculate the mean target values using the reaming 3 sets.
Then, substitute these values for the categories present in the holdout set.
Repeat this procedure for all other folds-
Finally, calculate the mean value of each category.
Let's say the data looks like this after the k -fold cross-validation-
Take the mean value for each of the categories -
Engineer : (0.7 + 0.8 + 0.8 + 0.9 + 0.9 )/5 = 0.82
Nurse : (0.4 + 0.4 + 0.3 )/3 = 0.37
Lawyer : (0.5 + 0.6 )/2 = 0.55
Substitute these values for the corresponding categories-
.
Now, this data can be used for further model-building procedures.
To summarize, encoding categorical data is an unavoidable part of feature engineering. It is essential to understand all types of encoding, apart from popular ones because In real-world problems, most of the time we might be choosing one specific encoding method for the proper working of the model. It should be noted that working with different encoders can vary the results of the model. Hence, the selection encoding type for a particular dataset/problem is one of the critical decisions taken by the researcher.
In this article, hope you got a basic understanding of the concept behind target encoding. This technique is available as an inbuilt library in most data science-oriented programming languages such as Python or R. Hence, it is easy to implement this once you understand the theoretical intuition. I have added the links to some of the advanced materials in the references section where you can dive deep into the complex calculations if you are interested.