paint-brush
Hinge Loss - A Steadfast Loss Evaluation Function for the SVM Classification Models in AI & MLby@sanjaykn170396
3,957 reads
3,957 reads

Hinge Loss - A Steadfast Loss Evaluation Function for the SVM Classification Models in AI & ML

by Sanjay KumarJanuary 4th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Machine learning is nothing but an optimisation problem. Researchers use an algebraic acme called “Losses” in order to optimise the machine learning space defined by a specific use case. Hinge loss is a function popularly used in support vector machine algorithms to measure the distance of data points from the decision boundary.
featured image - Hinge Loss - A Steadfast Loss Evaluation Function for the SVM Classification Models in AI & ML
Sanjay Kumar HackerNoon profile picture


Machine learning is nothing but an optimisation problem. Researchers use an algebraic acme called “Losses” in order to optimise the machine learning space defined by a specific use case. A “Loss” can be seen as a distance between the true values of the problem and the values predicted by the model. The greater the loss is, the more huge the errors you made on the data. Most of the performance evaluation metrics such as accuracy, precision, recall, f1 score etc are an indirect derivation of the Loss functions. There are a lot of loss functions implemented by the researchers like-


  • For Regression problems- Mean Squared Error, Mean of Absolute Error, Huber Loss, Log Cosh Loss, Quantile Loss etc.
  • For Binary classification problems - Binary Cross Entropy Loss, Hinge Loss etc.
  • For Multi classification problems - Multi-Class Cross Entropy Loss, KL- Divergence


In this article, I will introduce you to a sophisticated loss metric called “Hinge loss” which is discussed in some of the most recommended textbooks regarding predictive modelling. I hope the explanation will be in a lucid manner, both visually and mathematically to help beginner enthusiasts in the machine learning field.

The concept behind the Hinge loss

Hinge loss is a function popularly used in support vector machine algorithms to measure the distance of data points from the decision boundary. This helps approximate the possibility of incorrect predictions and evaluate the model's performance.


Some of the other popularly used loss functions in classification algorithms are-

  • Gini Impurity
  • Logarithmic loss
  • Misclassification error etc.


The support vector machine is a supervised machine learning algorithm that is popularly used for predicting the category of labelled data points.


For example-

  • Predicting whether a person is male or female
  • Predicting whether the fruit is an apple or orange
  • Predicting whether a student will pass or fail the exams etc.


SVM uses an imaginary plane that can travel across multiple dimensions for its prediction purpose. These imaginary planes which can travel through multiple dimensions are called hyperplanes. It is very difficult to imagine higher dimensions using human brains since our brain is naturally capable to visualize only up to 3 dimensions.


Let’s take a simple example to understand this scenario.

We have a classification problem to predict whether a student will pass or fail the examination. We have the following features as independent variables-

  • Marks in internal exams
  • Marks in projects
  • Attendance percentage


So, these 3 independent variables become 3 dimensions of a space like this-


Image source: Illustrated by the author


Let’s consider that our data points look like this where-

  • The green colour represents the students who passed the examination

  • The red colour represents the students who failed the examination


Now, SVM will create a hyperplane that travels through these 3 dimensions to differentiate the failed and passed students-

Image source: Illustrated by the author


So, technically now the model understands that every data points which falls on one side of the hyperplane belong to the students who passed the exams and vice versa. This hyperplane is called the decision boundary or maximum margin hyperplane. The distance from a data point to the decision boundary shows the strength of the prediction.


The following image shows better visualization-

Image source: Illustrated by the author


Logically,

  • If the distance between the decision boundary and the data point is relatively large then it means that the model is somewhat confident about its prediction.
  • If the distance between the decision boundary and the data point is relatively low then it means that the model is less confident about its prediction.


The following image will give you a better intuition -


Image source: Illustrated by the author


Here,

The value of the decision boundary is zero.

  • “+” indicates the prediction of the positive class.
  • “-” Indicates prediction of negative class.
  • yf(x) is the hinge loss function


There are 2 primary scenarios where the researchers use hinge loss in SVM-


Scenario 1 (In training data): To optimally build a model in multi-dimensional space which reduces the misclassification and strengthens the decision-making ability.

It also helps to build the best-fit decision boundary by selecting that decision boundary that has the minimum hinge loss out of the many options via trial and error method or hyperparameter tuning (This approach is similar to the process of finding the best fit line in the linear regression during the training process).


Scenario 2 (In testing data): To evaluate the performance of the SVM model.

Let us understand the calculation of hinge loss in SVM with respect to scenario 1. The below image is a visual representation of the hinge loss function.

Image source: Illustrated by the author


  • The X-axis shows the distance from the decision boundary to the data point.
  • The Y-axis shows the loss size or penalty that the hinge loss function calculates for a specific data point.


The dotted line represents the number 1 on X-axis. If a data point is correctly predicted by the model and its distance from the decision boundary is greater than 1 then the loss is very minimal (nearly zero).


If the data point is placed exactly at the decision boundary then the hinge loss will have a value of 1 (Obviously, the distance between the decision boundary and the data point will be zero).

If a data point is incorrectly predicted (classified) by the model then there are 2 possibilities-


Image source: Illustrated by the author


Possibility 1: The distance between the data point and the decision boundary is in a positive direction (Data point 1 in the above image).

Possibility 2: The distance between the data point and the decision boundary is in a negative direction (Data point 3 in the above image).


In possibility 1, the hinge loss will not increase rapidly i.e. the loss value will be below.

For example,

Let’s assume that the value of the decision boundary is 0 and the value of the predicted data point is +2.5 (but the actual value should be below 0 since it is a wrong prediction). Here, the difference between the predicted value and decision boundary is +2.5. Hence, the hinge loss will be low (This is depicted as data point 1 in the above image).


In possibility 2, the hinge loss will be increasing rapidly i.e. the loss value will be high.

For example,

Let’s assume that the value of the decision boundary is 0 and the value of the predicted data point is -1.5 (but the actual value should be above 0 since it is a wrong prediction). Here, the difference between the predicted value and decision boundary is -1.5. Hence, the hinge loss will be high (This is depicted as data point 3 in the above image).


Let us calculate the hinge loss for these 2 possibilities-

Image source: Illustrated by the author

As we discussed earlier,

  • The value of hinge loss is 1 when the value of the predicted data point is 0.

  • According to possibility 1, if the value of the predicted data point is +2.5 then the hinge loss of that prediction is zero.

  • According to possibility 2, if the value of the predicted data point is -1.5 then the hinge loss of that prediction is 2.5.


Similarly, the hinge loss for every data point can be calculated for a model. When an SVM model is constructed in a multidimensional plane, we should try always try to minimize the hinge loss as much as possible to increase the predictive ability of the model.


A real-time example with a sample dataset

Imagine that we have a binary classification problem to predict whether a student will pass/fail the examination based on the following predictor variables-


  • Attendance percentage
  • Marks scored in the internal exams
  • Marks scored in the assignments


We trained our model with 1000 records and now we have the following table as the test data-


Image source: Illustrated by the author


We evaluate the model using the following test data and make predictions. Our predictions are as follows-


The predicted value will be a number between -1 and 1 with a margin of 0.2. If the value is less than or equal to zero then the predicted class will be considered -1 and if the value is greater than zero then the predicted class will be considered +1.


Since the margin is 0.2 and the decision boundary is 0. The hinge loss is calculated for –

  • All incorrect predictions

  • All correct predictions within the range of [-0.2, +0.2]


    Image source: Illustrated by the author

The Hinge loss is as follows-

Image source: Illustrated by the author


  • The data points with unique Id 1, 5, 7, 8, and 9 are predicted correctly. Hence there is no hinge loss for those instances.
  • The data points with unique Ids 4, 6, and 10 are predicted incorrectly. Hence hinge loss is calculated for those data points.
  • The data points with unique Id 2 and 3 are predicted correctly but their value is within the range of margin. Hence hinge loss is calculated for those data points.
  • The total hinge loss of the model is the mean of all these values = (1.6 + 1.8 + 3.1 + 3 + 5 ) / 10 = 1.45


Conclusion

Hinge loss- although might initially look a little bit complicated, I hope you have got a fundamental intuition about this concept through this article. A lot of complex use cases can be optimally solved using this technique that demands a binary classification algorithm (especially Support Vector Machines). This metric is available as an inbuilt library in most data science-oriented programming languages such as Python or R. Hence, it is easy to implement this once you understand the theoretical intuition. I have added the links to some of the advanced materials in the references section where you can deep dive into the complex calculations if you are interested.

References