Tree-based Models in Python

Learn how to use tree-based models and ensembles for regression and classification with scikit-learn in python (DataCamp).

Classification and Regression Trees

Classification and Regression Trees (CART) are a set of supervised learning models used for problems involving classification and regression.

  • Decision-Tree: data structure consisting of a hierarchy of nodes
  • Node: question or prediction
    • Root: no parent node, question giving rise to two children nodes.
    • Internal node: one parent node, question giving rise to two children nodes.
    • Leaf: one parent node, no children nodes –> prediction.

Classification Tree

  • Sequence of if-else questions about individual features, and learns patterns from the features in such a way to produce the purest leafs.
    • In the end, in each leaf, one class-label is predominant.
  • Objective: infer class labels
  • Able to capture non-linear relationships between features and labels
  • Don’t require feature scaling (e.g. Standardization, ..)

Decision Regions

  • Decision region: region in the feature space where all instances are assigned to one class label.
  • Decision Boundary: surface separating different decision regions.
    • Linear boundary
    • Non-linear boundary
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Import accuracy_score
from sklearn.metrics import accuracy_score

# Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6
dt = DecisionTreeClassifier(max_depth = 6, random_state = 1)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict test set labels
y_pred = dt.predict(X_test)
print(y_pred[0:5])

# Compute test set accuracy
acc = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.2f}".format(acc))

Information Gain

How does a classification-Tree learn?

  • The nodes of a classification tree are grown recursively: the obtention of an internal node or a leaf depends on the state of its predecessors.
  • To produce the purest leaves possible, at each node, a tree asks a question involving one feature $f$ and a split-point $sp$.
  • But how does it know which feature and which split-point to pick? It does so by maximizing information gain, i.e. maxmize $IG(node)$
    • If it is a unconstrained tree and the $IG(node)$ = 0, declare the node a leaf.
    • If it is a constrained tree, like the max_depth was set to 2, then it will stop at the set depth no matter the value of the $IG(node)$.

The tree considers that every node contains information and aims at maximizing the information gain obtained after each split.

$$IG(f, sp) = I(parent) - \Big(\frac{N_{left}}{N}I(left) + \frac{N_{right}}{N}I(right)\Big)$$

To measure the information gain, we measure the impurity of a node:

  • Criteria to measure the impurity of a node $I(node)$:
    • gini index
    • entropy

Compare accuracy between models trained with entropy and gini index:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Instantiate dt_entropy, set 'entropy' as the information criterion
dt_entropy = DecisionTreeClassifier(max_depth = 8, criterion = 'entropy', random_state = 1)

# Fit dt_entropy to the training set
dt_entropy.fit(X_train, y_train)

# Import accuracy_score from sklearn.metrics
from sklearn.metrics import accuracy_score

# Use dt_entropy to predict test set labels
y_pred= dt_entropy.predict(X_test)

# Evaluate accuracy_entropy
accuracy_entropy = accuracy_score(y_pred, y_test)

# Print accuracy_entropy
print('Accuracy achieved by using entropy: ', accuracy_entropy)

# Print accuracy_gini
print('Accuracy achieved by using the gini index: ', accuracy_gini)

# output:
# Accuracy achieved by using entropy: 0.929824561404
# Accuracy achieved by using the gini index: 0.929824561404

Notice how the two models achieve exactly the same accuracy. Most of the time, the gini index and entropy lead to the same results. The gini index is slightly faster to compute and is the default criterion used in the DecisionTreeClassifier model of scikit-learn.

Regression Tree

Tree-based models help to make nonlinear predictions.

When a regression tree is trained on a dataset, the impurity of a node is measured using the mean-squared error of the targets in that node:
$$I(node) = \text{MSE}(node) = \frac{1}{N_{node}}\sum\limits_{i\in\text{node}}(y^{(i)} - \hat{y}{node})^2$$
$$\underbrace{\hat{y}
{node}}{\text{mean target value}} = \frac{1}{N{node}}\sum\limits_{i\in {node}}y^{(i)}$$
The regression tree tries to find the splits that produce leaves where in each leaf the target value are, on average, the closest possible to the mean-value of the labels in that particular leaf.

When making predictions, a new instance traverses the tree and reaches a certain leaf, and its target variable ‘y’ is computed as the average of the target variables contained in that leaf.
$$\hat{y}{pred}(leaf) = \frac{1}{N{leaf}}\sum\limits_{i \in {leaf}}y^{(i)}$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Import DecisionTreeRegressor from sklearn.tree
from sklearn.tree import DecisionTreeRegressor

# Instantiate dt
dt = DecisionTreeRegressor(max_depth = 8,
min_samples_leaf = 0.13, # stopping criteria, speficies the minimum % of dp in training set in each leaf
random_state = 3)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute y_pred
y_pred = dt.predict(X_test)

# Compute mse_dt
mse_dt = MSE(y_test, y_pred)

# Compute rmse_dt
rmse_dt = mse_dt**(1/2)

# Predict test set labels on linear regression
y_pred_lr = lr.predict(X_test)

# Compute mse_lr
mse_lr = MSE(y_test, y_pred_lr)

# Compute rmse_lr
rmse_lr = mse_lr**(1/2)

# Print rmse_lr
print('Linear Regression test set RMSE: {:.2f}'.format(rmse_lr))

# Print rmse_dt
print('Regression Tree test set RMSE: {:.2f}'.format(rmse_dt))

Bias-Variance Tradeoff

In supervised learning, we make the assumption that there is a mapping $f$ between features and labels.
$$y = f(x)$$
Here, $f$ is an unknown function that you want to determine.

In reality, data generation is always accompanied with randomness or noises. Therefore, our goal is to find a model $\hat{f}$ that best approximates $f$.

  • $\hat{f}$ can be Logistic Regression, Decision Tree, Neural Network…

In the process of training the model, we want to discard noises as much as possible, and also hoping to achieve a low predictive error on unseen datasets.

Generalization Error

Generalization Error of $\hat{f}$: Does $\hat{f}$ generalize well on unseen data?

It can be decomposed as follows:
$$\text{GE of } \hat{f} = bias^2+variance + \text{irreducible error}$$

Bias: error term that tells you, on average, how much $\hat{f} \neq f$

  • High bias models lead to underfitting, which means $\hat{f}$ is not flexible enough to approximate $f$.
  • When underfitting, the training set error is roughly equal to the test set error, and both errors are relatively high.

Variance: error term that tells you how much $\hat{f}$ is inconsistent over different training sets.

  • High variance models lead to overfitting, which means $\hat{f}(x)$ fits the training set noise. Thus it has very low predictive power on unseen data.
  • When overfitting, the testing set error is much much higher than the training set error.

Irreducible error: the error contribution of noise, and always a constant.

Model Complexity: sets the flexibility of $\hat{f}$ to approximate the true function $f$, like setting maximum tree depth, minimum samples per leaf…

  • When the model complexity increases, the variance increases while the bias decreases, and vice versa.
  • The best model complexity corresponds to the lowest generalization error.
  • This is known as the Bias-Variance Tradeoff

    Evaluate Performance

    We cannot directly calculate the generalization error of a model because:
  • $f$ is unknown
  • usually you only have one dataset
  • noise is unpredictable

To estimate the generalization error of a model,

  • split the data to training and test sets,
  • fit $\hat{f}$ to the training set,
  • evaluate the error of $\hat{f}$ on the unseen test set
  • generalization error of $\hat{f} \approx$ test set error of $\hat{f}$

Since test set should not be touched until we are confident about $\hat{f}$’s performance, and only want to use it to evaluate $\hat{f}$’s final performance, we introduce cross-validation.

  • K-Fold CV: $\text{CV error} = \frac{E_1+…+E_{k}}{k}$
  • Hold-Out CV

Diagosis:

  1. If CV error of $\hat{f}$ > training set error of $\hat{f}$:
    • $\hat{f}$ suffers from high variance and is overfitting.
    • Solutions:
      • Decrease model complexity (decrease max depth, increase min samples per leaf…)
      • Gather more data
  2. If CV error of $\hat{f} \approx$ training set error of $\hat{f} >>$ desired error:
    • $\hat{f}$ suffers from high bias and is underfitting
      • Solutions:
        • Increase model complexity (increase max depth, decrease min samples per leaf…)
        • Gather more relevant features
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth = 4, min_samples_leaf = 0.26, random_state = SEED)

from sklearn.model_selection import cross_val_score

# Compute the array containing the 10-folds CV MSEs
## Note that since cross_val_score has only the option of evaluating the negative MSEs,
## its output should be multiplied by negative one to obtain the MSEs.
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv = 10,
scoring = 'neg_mean_squared_error',
n_jobs=-1) # all cpus are used for calculation

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# CV RMSE: 5.14

# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
## output:
# Train RMSE: 5.15

Ensemble Learning

  • CART’s advantages
    • Simple to understand and interpret
    • Easy to use
    • Flexibility: ability to describe non-linear dependencies
    • Simple preprocessing: no need to standardize or normalize features
  • CART’s limitation
    • Classification can only produce orthogonal decision boundaries (rectangular)
    • Sensitive to small variations in the training set
    • High variance: unconstrained CARTs may overfit the training set
  • Solution: ensemble learning

Basics of Emsemble Learning

  • Train different models on the same dataset.
  • Let each model make its prediction.
  • Meta-model: aggregates predictions ofindividual models.
  • Final prediction: more robust and less prone to errors.
  • Best results: models are skillful in different ways.

Voting Classifier

EXAMPLE: Binary classfication task

  • $N$ classifiers make predictions: $P_1, P_2, …, P_N$ with $P_i = 0$ or $1$
  • Meta-model prediction: hard voting (majority vote)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Import models, including VotingClassifier meta-model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier

# Set seed for reproducibility
SEED = 1

# Instantiate lr
lr = LogisticRegression(random_state = SEED)

# Instantiate knn
knn = KNN(n_neighbors = 27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf = 0.13, random_state = SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:

# Fit clf to the training set
clf.fit(X_train, y_train)

# Predict y_pred
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_pred, y_test)

# Evaluate clf's accuracy on the test set
print('{:s} : {:.3f}'.format(clf_name, accuracy))

# output:
# Logistic Regression : 0.747
# K Nearest Neighbours : 0.724
# Classification Tree : 0.730

# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators = classifiers)

# Fit vc to the training set
vc.fit(X_train, y_train)

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_pred, y_test)
print('Voting Classifier: {:.3f}'.format(accuracy))

# output:
# Voting Classifier: 0.753

Bagging and Random Forests

Bagging

Bagging: Bootstrap Aggregation

  • Uses a techinque known as bootstrap (sample with replacement for multiple times at a fixed size)
  • Reduces variance of individual models in the ensemble.
  • Train each model on a bootstrap subset of the traning set
  • Output a final prediction:
    • Classification: aggregates predictions by majority voting. BaggingClassifier
    • Regression: aggregates predictions through averaging. BaggingRegressor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(random_state = 1)

# Instantiate bc
bc = BaggingClassifier(
base_estimator = dt,
n_estimators = 50, # number of bootstrap samples
random_state = 1)

from sklearn.metrics import accuracy_score

# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate acc_test
acc_test = accuracy_score(y_pred, y_test)
print('Test set accuracy of bc: {:.2f}'.format(acc_test))

# output:
# Test set accuracy of bc: 0.71
# A single tree dt would have achieved an accuracy of 63%
# which is 8% lower than bc's accuracy

Out-of-Bag Evaluation

When performing a bootstrapping, there will be a lot of instances that have not been sampled. Therefore, we can use these unused data points to test the performance of the model, instead of using CV.

$$\text{Final OOB Score} = \frac{OOB_1+…+OOB_N}{N}$$

In scikit-learn, OOB score corresponds to the accuracy of classifiers, while r-squared score are for regressors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf = 8, random_state = 1)

# Instantiate bc
bc = BaggingClassifier(base_estimator = dt,
n_estimators = 50,
oob_score = True,
random_state = 1)

from sklearn.metrics import accuracy_score

# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate test set accuracy
acc_test = accuracy_score(y_pred, y_test)

# Evaluate OOB accuracy
acc_oob = bc.oob_score_

# Print acc_test and acc_oob
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))

# output:
# Test set accuracy: 0.698, OOB accuracy: 0.704

Random Forest

Recall we talked above that for Bagging:

  • Base estimator can be: Decision Tree, LogisticRegression, Neural Net, …
  • Each estimator is trained on a distinct bootstrap sample of the training set
  • Estimators use all features for training and prediction

Random Forest:

  • is an ensemble method
  • Base estimator: Decision Tree
  • Each estimator is trained on a different bootstrap sample having the same size as the training set
  • RF introduces further randomization in the training of individual trees
    • $d$ features are sampled at each node without replacement ($d$ < total number of features $D$)
      • The default value of $d$ is $\sqrt{D}$, where $D$ stands for the total number of features.
  • Output a final prediction:
    • Classification: aggregates predictions by majority voting. RandomForestClassifier
    • Regression: aggregates predictions through averaging. RandomForestRegressor

In general, random forest achieves a lower variance than an individual tree

Feature Importance

  • Tree-based methods enable measuring the importance of each feature in prediction.
  • Intuitively, it means how much the tree nodes use a particular feature (weighted average) to reduce impurity
  • Can be accessed using the attribute feature_importance_
  • It is expressed as a percentage indicating the weight of that feature in training and prediction
  • Visualization of feature importances:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    import pandas as pd
    import matplotlib.pyplot as plt

    # Create a pd.Series of features importances
    importances_rf = pd.Series(rf.feature_importances_, index = X.columns)

    # Sort importances_rf
    sorted_importances_rf = importances_rf.sort_values()

    # Make a horizontal bar plot
    sorted_importances_rf.plot(kind = 'barh', color = 'lightgreen')
    plt.show()

Performing RF regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Instantiate rf
rf = RandomForestRegressor(n_estimators = 25,
random_state = 2)

# Fit rf to the training set
rf.fit(X_train, y_train)

# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Predict the test set labels
y_pred = rf.predict(X_test)

# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print rmse_test
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))

#output:
# Test set RMSE of rf: 51.97

Boosting

Boosting refers to an ensemble method in which many predictors are trained and each predictor learns from the errors of its predecessor.

Boosting:

  • Ensemble method combining several weak learners to from a strong learner
    • Weak learner: Model doing slightly better than radom guessing
      • e.g. Decision stump (CART whose max depth is 1)
  • Train predictors sequentially
  • Each predictor tries to correct its predecessor

AdaBoost

  • Stands for Adaptive Boosting
  • Each predictor pays more attention to the instances wrongly predicted by its predecessor, which is achieved by chaging the weights of training instances after each individual model.
  • Each predictor is assigne d a coefficient $\alpha$, which depends on the predictor’s training error
  • Details:
    • Suppose we have $N$ predictors in total.
    • The $\text{predictor}_1$ is trained on the initial dataset $(X, y)$, and the training error $e_1$ is determined.
    • We use $e_1$ to determine $\alpha_1$, which is $\text{predictor}_1$’s coefficient.
    • $\eta\alpha_1$ is then used to determine $W_2$ of the training instances for $\text{predictor}_2$.
      • Here, the incorrectly predicted instances will gain a higher weight, which would force the next predictor to pay more attention to these instances.
    • The above process is repeated sequentially, until $N$ predictors forming the ensemble are trained.
  • An important parameter: learning rate $\eta$
    • $\eta \in [0,1]$
    • It is used to shrink the coefficient $\alpha$ of a trained predictor
    • Tradeoff between $\eta$ and the number of predictors
      • Small $\eta$ should be compensated by a greater number of predictors.
  • Output a final prediction:
    • Classification: aggregates predictions by majority voting. AdaBoostClassifier
    • Regression: aggregates predictions through averaging. AdaBoostRegressor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

# Instantiate dt
dt = DecisionTreeClassifier(max_depth = 2, random_state=1)

# Instantiate ada
ada = AdaBoostClassifier(base_estimator = dt, n_estimators = 180, random_state = 1)

# Fit ada to the training set
ada.fit(X_train, y_train)

# Compute the probabilities of obtaining the positive class
y_pred_proba = ada.predict_proba(X_test)[:, 1]

# Import roc_auc_score
from sklearn.metrics import roc_auc_score

# Evaluate test-set roc_auc_score
ada_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print roc_auc_score
print('ROC AUC score: {:.2f}'.format(ada_roc_auc))

#output:
# ROC AUC score: 0.71

Gradient Boosting

  • Sequential correction of predecessor’s errors.
  • Does not tweak the weights of training instances.
  • Fit each predictor is trained using its predecessor’s residual errors as labels.
  • Gradient Boosted Trees: a CART is used as a base learner.

Gradient Boosted Trees for Regression:

  • Details:
    • The ensemble consists $N$ trees
    • $Tree_1$ is trained using feature matrix $X$ and the dataset labels $y$
    • The prediction $\hat{y_1}$ are used to determine the training set residual errors $r_1$
    • $Tree_2$ is then trained using feature matrix $X$ and residual error $r_1$ as labels.
    • The predicted residuals times learning rate, $\eta r_1$, is then used to determine the residuals of residuals, which are labeled $r_2$
    • This process is repeated until all of the $N$ trees forming the ensemble are trained.
  • Important parameter: learning rate $\eta$
    • $\eta \in [0,1]$
    • It is used to shrink the labels of each tree, which is essentially $r$ of the previous trained tree
    • Tradeoff between $\eta$ and the number of predictors
      • Small $\eta$ should be compensated by a greater number of predictors.
  • Output a final prediction:
    • Classification:
      • not covered in this course
      • GradientBoostingClassifier
    • Regression:
      • $y_{pred} = y_1 + \eta r_1 + … + \eta r_N$
      • GradientBoostingRegressor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate gb
gb = GradientBoostingRegressor(max_depth = 4,
n_estimators = 200,
random_state = 2)

# Fit gb to the training set
gb.fit(X_train, y_train)

# Predict test set labels
y_pred = gb.predict(X_test)

# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute MSE
mse_test = MSE(y_test, y_pred)

# Compute RMSE
rmse_test = mse_test**(1/2)

# Print RMSE
print('Test set RMSE of gb: {:.3f}'.format(rmse_test))

#output:
# Test set RMSE of gb: 52.065

Stochastic Gradient Boosting

Gradient Boosting has its limitations:

  • GB involves an exhaustive search procedure.
    • Each CART is trained to find the best split points and features, and thus may lead to CARTs using the same split points and maybe the same features.

Stochastic Gradient Boosting helps avoid this from happening:

  • Each tree is trained on a random subset of rows of the training data.
  • The sampled instances (40%-80% of the training set) are sampled without replacement.
  • Features are sampled (without replacement) when choosing split points.
  • Result: further ensemble diversity.
  • Effect: adding further variance to the ensemble of trees.

Details of SGB:

  • We randomly sample only a fraction of the training set without replacement to feed into the first predictor $Tree_1$.
  • Then, when it comes to features, we ramdomly sample a fraction of the features without replacement when considering makeing a split.
  • Once a tree is trained, predictions ($\hat{y_1}$) are made and the residual errors $r_1$ on the training set can be calculated.
  • These residual errors are then multiplied by learning rate $\eta$ and fed to the next tree in the ensemble.
  • This procedure is repeated sequentially until all the trees in the ensemble are trained.
  • The prediction procedure of SGB is similar to GB.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate sgbr
sgbr = GradientBoostingRegressor(
max_depth = 4,
subsample = 0.9, # set the fraction of training set to use in each tree
max_features = 0.75, # set the fraction of features to use when making a split
n_estimators = 200,
random_state = 2)

# Fit sgbr to the training set
sgbr.fit(X_train, y_train)

# Predict test set labels
y_pred = sgbr.predict(X_test)

# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute test set MSE
mse_test = MSE(y_pred, y_test)

# Compute test set RMSE
rmse_test = mse_test**(1/2)

# Print rmse_test
print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test))
# output:
# Test set RMSE of sgbr: 49.979

# The stochastic gradient boosting regressor achieves
# a lower test set RMSE than the gradient boosting regressor
# (which was 52.065)

Model Tuning

  • Parameters: learned from data
    • CART example: split-point of a node, split-feature of a node, …
  • Hyperparameters: not learned from data, set prior to training
    • CART example: max_depth , min_samples_leaf , splitting criterion …

Hyperparameter Tuning

  • Problem: search for a set of optimal hyperparameters for a learning algorithm.
  • Solution: find a set of optimal hyperparameters that results in an optimal model, which yields an optimal score.
    • Score: in sklearn defaults to accuracy (classication) and R (regression).
  • Cross validation is used to estimate the generalization performance.
  • Possible approaches:
    • Grid Search
    • Random Search
    • Bayesian Optimization
    • Genetic Algorithms

Grid search cross validation

  • Manually set a grid of discrete hyperparameter values.
  • Set a metric for scoring model performance.
  • Search exhaustively through the grid.
  • For each set of hyperparameters, evaluate each model’s CV score.
  • The optimal hyperparameters are those ofthe model achieving the best CV score.
  • It suffers from the curse of dimensionality
    • it is computationally expensive
    • and sometimes lead to very slight improvement
    • We should always weigh the impact of tuning in the whole project!

Using model.get_params() helps you know what are the hyperparameters behind this model.

If we set refit to True in grid search, the best model in the end will already be trained on the entire training set.

verbose controls verbosity. The higher the value, the more messages are outputted.

Tuning CART

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Define params_dt
params_dt = {
'max_depth':[2, 3, 4],
'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]
}

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator = dt,
param_grid = params_dt,
scoring = 'roc_auc',
cv = 5,
n_jobs = -1)

# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

#output:
# Test set ROC AUC score: 0.610
# An untuned classification-tree would achieve a ROC AUC score of 0.54

Tuning Random Forest

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Define the dictionary 'params_rf'
params_rf = {
'n_estimators':[100, 350, 500],
'max_features':['log2', 'auto', 'sqrt'],
'min_samples_leaf':[2, 10, 30]
}

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator = rf,
param_grid = params_rf,
scoring = 'neg_mean_squared_error',
cv = 3,
verbose = 1,
n_jobs = -1)

# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_pred, y_test)**(1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test))

#output:
# Test RMSE of best model: 50.569