Learn how to use tree-based models and ensembles for regression and classification with scikit-learn in python (DataCamp).
Classification and Regression Trees
Classification and Regression Trees (CART) are a set of supervised learning models used for problems involving classification and regression.
- Decision-Tree: data structure consisting of a hierarchy of nodes
- Node: question or prediction
- Root: no parent node, question giving rise to two children nodes.
- Internal node: one parent node, question giving rise to two children nodes.
- Leaf: one parent node, no children nodes –> prediction.
Classification Tree
- Sequence of if-else questions about individual features, and learns patterns from the features in such a way to produce the purest leafs.
- In the end, in each leaf, one class-label is predominant.
- Objective: infer class labels
- Able to capture non-linear relationships between features and labels
- Don’t require feature scaling (e.g. Standardization, ..)
Decision Regions
- Decision region: region in the feature space where all instances are assigned to one class label.
- Decision Boundary: surface separating different decision regions.
- Linear boundary
- Non-linear boundary
1 | # Import DecisionTreeClassifier from sklearn.tree |
Information Gain
How does a classification-Tree learn?
- The nodes of a classification tree are grown recursively: the obtention of an internal node or a leaf depends on the state of its predecessors.
- To produce the purest leaves possible, at each node, a tree asks a question involving one feature $f$ and a split-point $sp$.
- But how does it know which feature and which split-point to pick? It does so by maximizing information gain, i.e. maxmize $IG(node)$
- If it is a unconstrained tree and the $IG(node)$ = 0, declare the node a leaf.
- If it is a constrained tree, like the max_depth was set to 2, then it will stop at the set depth no matter the value of the $IG(node)$.
The tree considers that every node contains information and aims at maximizing the information gain obtained after each split.
$$IG(f, sp) = I(parent) - \Big(\frac{N_{left}}{N}I(left) + \frac{N_{right}}{N}I(right)\Big)$$
To measure the information gain, we measure the impurity of a node:
- Criteria to measure the impurity of a node $I(node)$:
- gini index
- entropy
- …
Compare accuracy between models trained with entropy and gini index:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier
# Instantiate dt_entropy, set 'entropy' as the information criterion
dt_entropy = DecisionTreeClassifier(max_depth = 8, criterion = 'entropy', random_state = 1)
# Fit dt_entropy to the training set
dt_entropy.fit(X_train, y_train)
# Import accuracy_score from sklearn.metrics
from sklearn.metrics import accuracy_score
# Use dt_entropy to predict test set labels
y_pred= dt_entropy.predict(X_test)
# Evaluate accuracy_entropy
accuracy_entropy = accuracy_score(y_pred, y_test)
# Print accuracy_entropy
print('Accuracy achieved by using entropy: ', accuracy_entropy)
# Print accuracy_gini
print('Accuracy achieved by using the gini index: ', accuracy_gini)
# output:
# Accuracy achieved by using entropy: 0.929824561404
# Accuracy achieved by using the gini index: 0.929824561404
Notice how the two models achieve exactly the same accuracy. Most of the time, the gini index and entropy lead to the same results. The gini index is slightly faster to compute and is the default criterion used in the DecisionTreeClassifier
model of scikit-learn.
Regression Tree
Tree-based models help to make nonlinear predictions.
When a regression tree is trained on a dataset, the impurity of a node is measured using the mean-squared error of the targets in that node:
$$I(node) = \text{MSE}(node) = \frac{1}{N_{node}}\sum\limits_{i\in\text{node}}(y^{(i)} - \hat{y}{node})^2$$
$$\underbrace{\hat{y}{node}}{\text{mean target value}} = \frac{1}{N{node}}\sum\limits_{i\in {node}}y^{(i)}$$
The regression tree tries to find the splits that produce leaves where in each leaf the target value are, on average, the closest possible to the mean-value of the labels in that particular leaf.
When making predictions, a new instance traverses the tree and reaches a certain leaf, and its target variable ‘y’ is computed as the average of the target variables contained in that leaf.
$$\hat{y}{pred}(leaf) = \frac{1}{N{leaf}}\sum\limits_{i \in {leaf}}y^{(i)}$$
1 | # Import DecisionTreeRegressor from sklearn.tree |
Bias-Variance Tradeoff
In supervised learning, we make the assumption that there is a mapping $f$ between features and labels.
$$y = f(x)$$
Here, $f$ is an unknown function that you want to determine.
In reality, data generation is always accompanied with randomness or noises. Therefore, our goal is to find a model $\hat{f}$ that best approximates $f$.
- $\hat{f}$ can be Logistic Regression, Decision Tree, Neural Network…
In the process of training the model, we want to discard noises as much as possible, and also hoping to achieve a low predictive error on unseen datasets.
Generalization Error
Generalization Error of $\hat{f}$: Does $\hat{f}$ generalize well on unseen data?
It can be decomposed as follows:
$$\text{GE of } \hat{f} = bias^2+variance + \text{irreducible error}$$
Bias: error term that tells you, on average, how much $\hat{f} \neq f$
- High bias models lead to underfitting, which means $\hat{f}$ is not flexible enough to approximate $f$.
- When underfitting, the training set error is roughly equal to the test set error, and both errors are relatively high.
Variance: error term that tells you how much $\hat{f}$ is inconsistent over different training sets.
- High variance models lead to overfitting, which means $\hat{f}(x)$ fits the training set noise. Thus it has very low predictive power on unseen data.
- When overfitting, the testing set error is much much higher than the training set error.
Irreducible error: the error contribution of noise, and always a constant.
Model Complexity: sets the flexibility of $\hat{f}$ to approximate the true function $f$, like setting maximum tree depth, minimum samples per leaf…
- When the model complexity increases, the variance increases while the bias decreases, and vice versa.
- The best model complexity corresponds to the lowest generalization error.
- This is known as the Bias-Variance Tradeoff
Evaluate Performance
We cannot directly calculate the generalization error of a model because: - $f$ is unknown
- usually you only have one dataset
- noise is unpredictable
To estimate the generalization error of a model,
- split the data to training and test sets,
- fit $\hat{f}$ to the training set,
- evaluate the error of $\hat{f}$ on the unseen test set
- generalization error of $\hat{f} \approx$ test set error of $\hat{f}$
Since test set should not be touched until we are confident about $\hat{f}$’s performance, and only want to use it to evaluate $\hat{f}$’s final performance, we introduce cross-validation.
- K-Fold CV: $\text{CV error} = \frac{E_1+…+E_{k}}{k}$
- Hold-Out CV
Diagosis:
- If CV error of $\hat{f}$ > training set error of $\hat{f}$:
- $\hat{f}$ suffers from high variance and is overfitting.
- Solutions:
- Decrease model complexity (decrease max depth, increase min samples per leaf…)
- Gather more data
- If CV error of $\hat{f} \approx$ training set error of $\hat{f} >>$ desired error:
- $\hat{f}$ suffers from high bias and is underfitting
- Solutions:
- Increase model complexity (increase max depth, decrease min samples per leaf…)
- Gather more relevant features
- Solutions:
- $\hat{f}$ suffers from high bias and is underfitting
1 | # Import train_test_split from sklearn.model_selection |
Ensemble Learning
- CART’s advantages
- Simple to understand and interpret
- Easy to use
- Flexibility: ability to describe non-linear dependencies
- Simple preprocessing: no need to standardize or normalize features
- CART’s limitation
- Classification can only produce orthogonal decision boundaries (rectangular)
- Sensitive to small variations in the training set
- High variance: unconstrained CARTs may overfit the training set
- Solution: ensemble learning
Basics of Emsemble Learning
- Train different models on the same dataset.
- Let each model make its prediction.
- Meta-model: aggregates predictions ofindividual models.
- Final prediction: more robust and less prone to errors.
- Best results: models are skillful in different ways.
Voting Classifier
EXAMPLE: Binary classfication task
- $N$ classifiers make predictions: $P_1, P_2, …, P_N$ with $P_i = 0$ or $1$
- Meta-model prediction: hard voting (majority vote)
1 | # Import functions to compute accuracy and split data |
Bagging and Random Forests
Bagging
Bagging: Bootstrap Aggregation
- Uses a techinque known as bootstrap (sample with replacement for multiple times at a fixed size)
- Reduces variance of individual models in the ensemble.
- Train each model on a bootstrap subset of the traning set
- Output a final prediction:
- Classification: aggregates predictions by majority voting.
BaggingClassifier
- Regression: aggregates predictions through averaging.
BaggingRegressor
- Classification: aggregates predictions by majority voting.
1 | # Import DecisionTreeClassifier |
Out-of-Bag Evaluation
When performing a bootstrapping, there will be a lot of instances that have not been sampled. Therefore, we can use these unused data points to test the performance of the model, instead of using CV.
$$\text{Final OOB Score} = \frac{OOB_1+…+OOB_N}{N}$$
In scikit-learn, OOB score corresponds to the accuracy of classifiers, while r-squared score are for regressors.
1 | # Import DecisionTreeClassifier |
Random Forest
Recall we talked above that for Bagging:
- Base estimator can be: Decision Tree, LogisticRegression, Neural Net, …
- Each estimator is trained on a distinct bootstrap sample of the training set
- Estimators use all features for training and prediction
Random Forest:
- is an ensemble method
- Base estimator: Decision Tree
- Each estimator is trained on a different bootstrap sample having the same size as the training set
- RF introduces further randomization in the training of individual trees
- $d$ features are sampled at each node without replacement ($d$ < total number of features $D$)
- The default value of $d$ is $\sqrt{D}$, where $D$ stands for the total number of features.
- $d$ features are sampled at each node without replacement ($d$ < total number of features $D$)
- Output a final prediction:
- Classification: aggregates predictions by majority voting.
RandomForestClassifier
- Regression: aggregates predictions through averaging.
RandomForestRegressor
- Classification: aggregates predictions by majority voting.
In general, random forest achieves a lower variance than an individual tree
Feature Importance
- Tree-based methods enable measuring the importance of each feature in prediction.
- Intuitively, it means how much the tree nodes use a particular feature (weighted average) to reduce impurity
- Can be accessed using the attribute
feature_importance_
- It is expressed as a percentage indicating the weight of that feature in training and prediction
- Visualization of feature importances:
1
2
3
4
5
6
7
8
9
10
11
12import pandas as pd
import matplotlib.pyplot as plt
# Create a pd.Series of features importances
importances_rf = pd.Series(rf.feature_importances_, index = X.columns)
# Sort importances_rf
sorted_importances_rf = importances_rf.sort_values()
# Make a horizontal bar plot
sorted_importances_rf.plot(kind = 'barh', color = 'lightgreen')
plt.show()
Performing RF regression1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
# Instantiate rf
rf = RandomForestRegressor(n_estimators = 25,
random_state = 2)
# Fit rf to the training set
rf.fit(X_train, y_train)
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE
# Predict the test set labels
y_pred = rf.predict(X_test)
# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)
# Print rmse_test
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
#output:
# Test set RMSE of rf: 51.97
Boosting
Boosting refers to an ensemble method in which many predictors are trained and each predictor learns from the errors of its predecessor.
Boosting:
- Ensemble method combining several weak learners to from a strong learner
- Weak learner: Model doing slightly better than radom guessing
- e.g. Decision stump (CART whose max depth is 1)
- Weak learner: Model doing slightly better than radom guessing
- Train predictors sequentially
- Each predictor tries to correct its predecessor
AdaBoost
- Stands for Adaptive Boosting
- Each predictor pays more attention to the instances wrongly predicted by its predecessor, which is achieved by chaging the weights of training instances after each individual model.
- Each predictor is assigne d a coefficient $\alpha$, which depends on the predictor’s training error
- Details:
- Suppose we have $N$ predictors in total.
- The $\text{predictor}_1$ is trained on the initial dataset $(X, y)$, and the training error $e_1$ is determined.
- We use $e_1$ to determine $\alpha_1$, which is $\text{predictor}_1$’s coefficient.
- $\eta\alpha_1$ is then used to determine $W_2$ of the training instances for $\text{predictor}_2$.
- Here, the incorrectly predicted instances will gain a higher weight, which would force the next predictor to pay more attention to these instances.
- The above process is repeated sequentially, until $N$ predictors forming the ensemble are trained.
- An important parameter: learning rate $\eta$
- $\eta \in [0,1]$
- It is used to shrink the coefficient $\alpha$ of a trained predictor
- Tradeoff between $\eta$ and the number of predictors
- Small $\eta$ should be compensated by a greater number of predictors.
- Output a final prediction:
- Classification: aggregates predictions by majority voting.
AdaBoostClassifier
- Regression: aggregates predictions through averaging.
AdaBoostRegressor
- Classification: aggregates predictions by majority voting.
1 | # Import DecisionTreeClassifier |
Gradient Boosting
- Sequential correction of predecessor’s errors.
- Does not tweak the weights of training instances.
- Fit each predictor is trained using its predecessor’s residual errors as labels.
- Gradient Boosted Trees: a CART is used as a base learner.
Gradient Boosted Trees for Regression:
- Details:
- The ensemble consists $N$ trees
- $Tree_1$ is trained using feature matrix $X$ and the dataset labels $y$
- The prediction $\hat{y_1}$ are used to determine the training set residual errors $r_1$
- $Tree_2$ is then trained using feature matrix $X$ and residual error $r_1$ as labels.
- The predicted residuals times learning rate, $\eta r_1$, is then used to determine the residuals of residuals, which are labeled $r_2$
- This process is repeated until all of the $N$ trees forming the ensemble are trained.
- Important parameter: learning rate $\eta$
- $\eta \in [0,1]$
- It is used to shrink the labels of each tree, which is essentially $r$ of the previous trained tree
- Tradeoff between $\eta$ and the number of predictors
- Small $\eta$ should be compensated by a greater number of predictors.
- Output a final prediction:
- Classification:
- not covered in this course
GradientBoostingClassifier
- Regression:
- $y_{pred} = y_1 + \eta r_1 + … + \eta r_N$
GradientBoostingRegressor
- Classification:
1 | # Import GradientBoostingRegressor |
Stochastic Gradient Boosting
Gradient Boosting has its limitations:
- GB involves an exhaustive search procedure.
- Each CART is trained to find the best split points and features, and thus may lead to CARTs using the same split points and maybe the same features.
Stochastic Gradient Boosting helps avoid this from happening:
- Each tree is trained on a random subset of rows of the training data.
- The sampled instances (40%-80% of the training set) are sampled without replacement.
- Features are sampled (without replacement) when choosing split points.
- Result: further ensemble diversity.
- Effect: adding further variance to the ensemble of trees.
Details of SGB:
- We randomly sample only a fraction of the training set without replacement to feed into the first predictor $Tree_1$.
- Then, when it comes to features, we ramdomly sample a fraction of the features without replacement when considering makeing a split.
- Once a tree is trained, predictions ($\hat{y_1}$) are made and the residual errors $r_1$ on the training set can be calculated.
- These residual errors are then multiplied by learning rate $\eta$ and fed to the next tree in the ensemble.
- This procedure is repeated sequentially until all the trees in the ensemble are trained.
- The prediction procedure of SGB is similar to GB.
1 | # Import GradientBoostingRegressor |
Model Tuning
- Parameters: learned from data
- CART example: split-point of a node, split-feature of a node, …
- Hyperparameters: not learned from data, set prior to training
- CART example: max_depth , min_samples_leaf , splitting criterion …
Hyperparameter Tuning
- Problem: search for a set of optimal hyperparameters for a learning algorithm.
- Solution: find a set of optimal hyperparameters that results in an optimal model, which yields an optimal score.
- Score: in sklearn defaults to accuracy (classication) and R (regression).
- Cross validation is used to estimate the generalization performance.
- Possible approaches:
- Grid Search
- Random Search
- Bayesian Optimization
- Genetic Algorithms
- …
Grid search cross validation
- Manually set a grid of discrete hyperparameter values.
- Set a metric for scoring model performance.
- Search exhaustively through the grid.
- For each set of hyperparameters, evaluate each model’s CV score.
- The optimal hyperparameters are those ofthe model achieving the best CV score.
- It suffers from the curse of dimensionality
- it is computationally expensive
- and sometimes lead to very slight improvement
- We should always weigh the impact of tuning in the whole project!
Using model.get_params()
helps you know what are the hyperparameters behind this model.
If we set refit
to True
in grid search, the best model in the end will already be trained on the entire training set.
verbose
controls verbosity. The higher the value, the more messages are outputted.
Tuning CART
1 | # Define params_dt |
Tuning Random Forest
1 | # Define the dictionary 'params_rf' |