Supervised learning in Python with scikit-learn (DataCamp).
What is machine learning?
- The art and science of :
- Giving computers the ability to learn to make decisions from data
- … without being explicitly programmed.
- e.g.
- Learning to predict whether an email is spam or not.
- Clustering wikipedia entries into different categories
Supervised learning: predict the target variable, given the predictor variables
- Key features:
- Automate time-consuiming or expensive manual tasks (e.g. Doctor’s diagnosis)
- Make predictions about the future (e.g. will a customer click or not?)
- Need labeled data
- Historical data with labels
- Experiments to get labeled data
- Crowd-sourcing labeled data
- Types:
- Classification: Target variable consists of categories
- Regression: Target variable is continuous
Unsupervised learning: Uncovering hidden patterns from unlabeled data
- e.g.
- Grouping customers into distinct categories (Clustering)
Reinforcement learning: software agents interect with an environment.
- Key features:
- Learn how to optimize their behavior
- Given a system of rewards and punishments
- Draws inspiration from behavioral psychology
- Applications
- Economics
- Genetics
- Game playing
All machine learning models implemented as Python classes:
- They implement the algorithms for learning and predicting
- Store the information learned from the data
- Traning a model on the data = ‘fitting’ a model to the data:
.fit()
method - To predict the labels of new data:
.predict()
method.
Classification
EDA
1 | from sklearn import datasets |
KNN
K-Nearest Neighbors
- Basic idea: Predict the label of a data point by
- Looking at the
k
closest labeled data points - Taking a majority vote
- Looking at the
- Model Complexity
- larger k = smoother decision boundary = less complex model
- smaller k = more complex model = can lead to overfitting and sensitive to noise
- Use Model Complexity curve to decide your best k
Measuring Model Performance
- Split data into training and test set
- Fit the classifier on the training set
- Make predictions on the test set
- Compare predictions with the known labels
1 | # Import necessary modules |
Model Complexity curve1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Loop over different values of k
for i, k in enumerate(neighbors):
# Setup a k-NN Classifier with k neighbors: knn
knn = KNeighborsClassifier(n_neighbors = k)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
#Compute accuracy on the training set
train_accuracy[i] = knn.score(X_train, y_train)
#Compute accuracy on the testing set
test_accuracy[i] = knn.score(X_test, y_test)
# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()
Regression
Importing data
1 | # Import numpy and pandas |
Linear Regression
y = ax + b
y
= targetx
= single featurea
,b
= parameters of model
- How to choose
a
andb
?- Define an error/loss/cost function for any given line
- Choose the line that minimizes the error function
Using only one feature to predict.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24# Import LinearRegression
from sklearn.linear_model import LinearRegression
# Create the regressor: reg
reg = LinearRegression()
# Create the prediction space
prediction_space = np.linspace(
min(X_fertility),
max(X_fertility)
).reshape(-1, 1)
# Fit the model to the data
reg.fit(X_fertility, y)
# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)
# Print R^2
print(reg.score(X_fertility, y))
# Plot regression line
plt.plot(prediction_space, y_pred, color = 'black', linewidth = 3)
plt.show()
Using the whole dataset to predict with training and testing set split.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# Create the regressor: reg_all
reg_all = LinearRegression()
# Fit the regressor to the training data
reg_all.fit(X_train, y_train)
# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)
# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
Cross-validation
To prevent your metric of choice being dependent on the train test split, we can use k-fold cross validation. The larger the k, the more computationality expensive it is. Therefore, it is common practice to choose 5 or 10 folds.1
2
3
4
5
6
7
8
9
10
11
12
13
14# Import the necessary modules
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
# Create a linear regression object: reg
reg = LinearRegression()
# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv = 5)
# Print the 5-fold cross-validation scores
print(cv_scores)
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))
You can use %timeit
to see how long each 3-fold CV takes compared to 10-fold CV by executing the following cv = 3
and cv = 10
1
%timeit cross_val_score(reg, X, y, cv = ____)
Regularization
Linear regression is essentially minimizing a loss function, during which it needs to choose a coefficient $a_i$ for each feature variable. If we allow this coefficients or parameters to be super large, we can get overfitting, especially when it is in a high dimensional space. For this reason, it is common practice to alter the loss function so that it penalizes for large coefficients.
Ridge Regression
$L2$ regularization
- Loss function = OLS loss function + $\alpha * \sum\limits_{i=1}^na^2_i$
With this loss function, when minimizing the loss function to fit to our data, models are penalized for coefficients with a large magnitude: large positive and negative coefficients.
Therefore, $\alpha$ (which is often called $\lambda$ in the wild) is a hyperparameter here that needs tuning. $\alpha$ controls model complexity.
- $\alpha = 0$: original OLS. Can lead to overfitting
- $\alpha = \infin$: can lead to underfitting
1 | from sklearn.linear_model import Ridge |
Lasso Regression
$L1$ regularization
- Loss function = OLS loss function + $\alpha * \sum\limits_{i=1}^n|a_i|$
Lasso regression can be usde to select important features of a dataset, because it shrinks the coefficients of less important features to exactly 0. The features whose coefficients are not shrunk to zero are ‘selected’ by the LASSO algorithm. Therefore, it is a very good method to use in reality for reporting purposes.
1 | # Import Lasso |
1 | # Import necessary modules |
Elastic Net
L1 ratio
regularization
- Loss function = OLS loss function + $\alpha L1 + b L2$
In scikit-learn, this term is represented by the 'l1_ratio'
parameter: An 'l1_ratio'
of 1 corresponds to an L1 penalty, and anything lower is a combination of $L1$ and $L2$.
1 | # Import necessary modules |
Fine-tuning the Model
Classification Model Evaluation
Normally, accuracy is a good enough metric to evaluate the performance. However, for imbalanced dataset, when we are trying to predict on a small amount of positively labeled data, accuracy rate cannot reflect the true “performance” of the model.
Confusion matrix
Predicted:Spam Email | Predicted: Real Email | |
---|---|---|
Actual: Spam Email | True Positive | False Negative |
Actual: Real Email | False Positive | True Negative |
- TP rate = $\frac{tp}{tp + fn}$
- FP rate = $\frac{fp}{fp + tn}$ = 1 - Specificity
- Accuracy = $\frac{tp + tn}{tp + tn + fp + fn}$
- Precision = $\frac{tp}{tp + fp}$
- High precision means: In emails that are predicted as spam, most of them are predicted correctly.
- Recall = $\frac{tp}{tp + fn}$, which is also the True Positvie Rate and Sensitivity.
- High recall means: In the actual spam emails, most of them were predicted correctly.
- F1 Score = $2 \frac{\text{precision} \text{recall}}{\text{precision} + \text{recall}}$ The harmonic mean of precision and recall.
- the classification report consisted of three rows, and an additional support column. The support gives the number of samples of the true response that lie in that class.
- Sensitivity = recall = TP rate = $\frac{tp}{tp + fn}$
- Specificity = selectivity = TN rate = $\frac{tn}{tn + fn}$
1 | # Import packages |
Logistic Regression
Logistic regression for binary classification
- Logistic regression outputs probabilities.
- If the probability ‘p’ is greater than 0.5, the data is labeled ‘1’
- If the probability ‘p’ is less than 0.5, the data is labeled ‘0’
- The above rules create a linear decision boundary.
- 0.5 is actually a threshold and can be changed according to different scenarios. Here, we use the ROC curve to help us decide the threshold.
1 | # Import the necessary modules |
ROC Curve
- X-axis: FP rate, Y-axis: TP rate
- When threshold = 0, the model predicts 1 for all data points, then the FP and TP rates are all 1.
- When threshold = 1, the model predicts 0 for all data points, then the FP and TP rates are all 0.
- If we vary the threshold between the two extremes, we get a series of different fp and tp rates. The curve we get when trying all possible thresholds is called the Recevier Operating Characteristic curve or ROC curve.
1 | # Import necessary modules |
Precision-recall Curve can also serve as a similar purpose.
Area under the ROC curve (AUC)
Larger area under the ROC curve = better model
The idea is that: if we can produce a model with tp rate = 1 and fp rate = 0, it would be magnificent! Therefore, we can see that the area under the ROC curve, commonly denoted as AUC, is another popular metric for classification models.
1 | # Import necessary modules |
Hyperparameter Tuning
Hyperparameters are the ones that cannot be learned by fitting the model. The steps to choose the correct parameters are as follows:
- Try a bunch of different hyperparameter values
- Fit all of them separately
- See how well each performs (make sure to us cv to avoid overfitting)
- Choose the best performing one
When implementing the tuning, we generally use grid search method.
Like the $\alpha$ parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter: $C$. $C$ controls the inverse of the regularization strength, and this is what you will tune below.
A large $C$ can lead to an overfit model, while a small $C$ can lead to an underfit model.
In addition to $C$, logistic regression has a 'penalty'
hyperparameter which specifies whether to use $L1$ or $L2$ regularization.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}
# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)
# Fit it to the training data
logreg_cv.fit(X_train, y_train)
# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))
GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV
, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.
1 | # Import necessary modules |
Hold-out Set
How well can the model perform on never before seen data, given my scoring function of choice?
We should split the dataset into training and testing before training the model and carry out the cv, so that we can evaluate the final performance of our model on a never-seen-dataset after we finalized the hyperparameters, which is the testing set/hold-out set.
Preprocessing and Pipelines
Categorical features
Dummy coding:
- 0: Observation that does not belong to this category
- 1: Observation that belongs to this category
- Remember to remove one variable after dummy coding to prevent duplicating information. Since if there are only 3 categories and we know sth is not A or B, then it must be C. This can be achieved by specifying
drop_first=True
inpd.get_dummies()
1 | # Import necessary modules |
Missing data
Even when df.info()
suggests that all columns are not null, there are still many other possible forms of missing data. For instance, 0, ?, !, .. ,etc.
Ways to deal with missing data:
- Dropping missing data:
df.dropna()
(generally a bad idea) - Imputing missing data
- Using the mean of the non-missing entries
1 | # Convert '?' to NaN |
Pipeline: each step but the last must be a transformer and the last must be an estimator, such as, a classifier or a regressor.
1 | # Import the Imputer module |
Centering and Scaling
Why we need scaling/normalizing?
- Many models use some form of distance to inform them. Therefore, features on larger scales can unduly influence the model.
- We want features to be on a similar scale.
Ways to normalize data:
- Standardization: Subtract the mean and divide by variance.
- All features are centered around zero and have variance one.
- Subtract the minumum and divide by the range.
- Minimum zero and maximum one
- Normalize the range from -1 to +1
1 | # Import scale |
Using pipeline to scale the data together with modeling.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
('knn', KNeighborsClassifier())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)
# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))
All together
Pipeline for classification
It is time now to piece together everything you have learned so far into a pipeline for classification! Your job in this exercise is to build a pipeline that includes scaling and hyperparameter tuning to classify wine quality.
You’ll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are $C$ and $\gamma$ (gamma). $C$ controls the regularization strength. It is analogous to the $C$ you tuned for logistic regression, while $\gamma$ controls the kernel coefficient.
1 | from sklearn.pipeline import Pipeline |
Pipeline for regression
Your job is to build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the l1_ratio of your ElasticNet using GridSearchCV.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
('scaler', StandardScaler()),
('elasticnet', ElasticNet())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, parameters, cv = 3)
# Fit to the training set
gm_cv.fit(X_train, y_train)
# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))