Initialize the first simple algorithm b0
On each iteration, we make a shift vector s = (s1,..sl)
. si
— the values of the algorithm bN(xi) = si
on a training sample
bN
to the ensemble
Simple Python implementation using decision trees as weak learners:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from scipy.optimize import minimize_scalar
class GradientBoosting:
def init(self, n_estimators=100, learning_rate=0.1):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.trees = []
def _mse_gradient(self, y_true, y_pred):
return -(y_true - y_pred)
def fit(self, X, y):
n = len(y)
F = np.full(n, np.mean(y))
for m in range(self.n_estimators):
residuals = self._mse_gradient(y, F)
tree = DecisionTreeRegressor(max_depth=2)
tree.fit(X, residuals)
self.trees.append(tree)
gamma = self.learning_rate
F += gamma * tree.predict(X)
def predict(self, X):
n = len(X)
F = np.full(n, np.mean(y))
for tree in self.trees:
F += self.learning_rate * tree.predict(X)
return F
#Usage example:
X, y = load_your_data()
gb = GradientBoosting(n_estimators=100, learning_rate=0.1)
gb.fit(X, y)
y_pred = gb.predict(X_test)
This is a basic implementation for demonstration purposes. Something similar with minor optimizations can be found in the sci-kit-learn library: sklearn.ensemble.GradientBoostingRegressor
or sklearn.ensemble.GradientBoostingClassifier
.
The concept of boosting was introduced by Robert Schapire in 1990, but it gained widespread attention with the development of the AdaBoost algorithm by Yoav Freund and Robert Schapire in 1995.
AdaBoost, short for Adaptive Boosting, is an ensemble learning method that combines multiple weak classifiers to form a strong classifier.
It works by iteratively training classifiers on weighted training data, where the weights are updated based on the errors made by the previous classifiers.
In 1999, Jerome H. Friedman introduced the Gradient Boosting Machines (GBM) algorithm. GBM is another ensemble learning method that uses gradient descent to optimize the loss function.
Instead of adjusting the weights of training instances, GBM adds new decision trees to the model to correct the residual errors made by the existing trees. This way, GBM combines multiple weak decision trees into a single powerful model.
While the core idea of boosting remains the same—building an ensemble of weak learners (decision trees) sequentially to minimize the loss function—XGBoost incorporates a number of enhancements for better performance and efficiency:
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25)
model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Tree structure: Leaf-wise trees, choosing the leaf with the maximum delta loss to split.
import lightgbm as lgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25)
train_data = lgb.Dataset(X_train, label=y_train)
params = {"objective": "regression", "metric": "rmse", "learning_rate": 0.1, "num_leaves": 31}
model = lgb.train(params, train_data, num_boost_round=100)
predictions = model.predict(X_test)
Uses a symmetric tree growth algorithm, where every node at the same depth has the same split condition. This structure reduces the complexity of the model and speeds up the training process, while still allowing for accurate predictions.
Ordered Boosting to reduce overfitting. In traditional gradient boosting, the models are trained sequentially, with each model trying to correct the errors made by its predecessors. This process can lead to overfitting, especially when the training dataset is small or noisy.
To address this issue, ordered boosting introduces a new approach to training. Instead of training the models sequentially, ordered boosting splits the training dataset into multiple parts. Each model is then trained on a different part of the dataset, with the samples sorted by their target values.
Once a model is trained on its designated part, it is then applied to the remaining samples in the dataset. This process is repeated for all models, and the results are combined to create the final model. By training and evaluating the models in this way, ordered boosting can provide a more accurate estimation of the model's performance on unseen data, thus reducing overfitting.
import catboost as cb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25)
model = cb.CatBoostRegressor(iterations=100, learning_rate=0.1, depth=6)
model.fit(X_train, y_train, verbose=False)
predictions = model.predict(X_test)
In conclusion, the evolution of gradient boosting techniques, from the pioneering AdaBoost to the more recent and sophisticated XGBoost, LightGBM, and CatBoost, has led to significant improvements in both the performance and efficiency of ensemble learning methods.
CatBoost is particularly useful when dealing with categorical features, XGBoost is known for its high performance and regularization capabilities, and LightGBM is designed for efficiency and speed.
Choosing the right library depends on the specific problem you are trying to solve and the characteristics of your dataset. It is generally a good idea to try all three to see which one performs best for your particular use case.
And, as we continue to push the boundaries of what is possible with machine learning, the lessons learned from the development and application of gradient-boosting techniques will no doubt continue to guide and inspire future innovations in the field.