Supervised Learning with scikit-learn

Supervised learning in Python with scikit-learn (DataCamp).

What is machine learning?

  • The art and science of :
    • Giving computers the ability to learn to make decisions from data
    • … without being explicitly programmed.
  • e.g.
    • Learning to predict whether an email is spam or not.
    • Clustering wikipedia entries into different categories

Supervised learning: predict the target variable, given the predictor variables

  • Key features:
    • Automate time-consuiming or expensive manual tasks (e.g. Doctor’s diagnosis)
    • Make predictions about the future (e.g. will a customer click or not?)
    • Need labeled data
      • Historical data with labels
      • Experiments to get labeled data
      • Crowd-sourcing labeled data
  • Types:
    • Classification: Target variable consists of categories
    • Regression: Target variable is continuous

Unsupervised learning: Uncovering hidden patterns from unlabeled data

  • e.g.
    • Grouping customers into distinct categories (Clustering)

Reinforcement learning: software agents interect with an environment.

  • Key features:
    • Learn how to optimize their behavior
    • Given a system of rewards and punishments
    • Draws inspiration from behavioral psychology
  • Applications
    • Economics
    • Genetics
    • Game playing

All machine learning models implemented as Python classes:

  • They implement the algorithms for learning and predicting
  • Store the information learned from the data
  • Traning a model on the data = ‘fitting’ a model to the data: .fit() method
  • To predict the labels of new data: .predict() method.

Classification

EDA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Use plot style 'ggplot'
plt.style.use('ggplot')

# Load iris dataset
iris = datasets.load_iris()

# Check what information is included
print(iris.keys())
## 'data', data columns
## 'target_names', name of the target variable
## 'DESCR', description of the dataset
## 'feature_names', names of the features
## 'target' target variable

# transform arrays into one dataframe
X = df.data
y = df.target
df = pd.DataFrame(X, columns = iris.feature_names)

# visual EDA.
# diagonal are histograms, off-diagonal are scatter plots
_ = pd.scatter_matrix(
df, #data
c = y, #color, ensuring our dataset will be colored by the target variables
figsize = [8, 8], #figure size
s = 150, #marker size
marker = 'D' #marker shape
)

# seaborn countplot (use another set of data set as illustration)
plt.figure()
sns.countplot(
x = 'education', #one feature name
hue = 'party', #target variable
data = df,
palette = 'RdBu') #red and blue color pallete
plt.xticks([0,1], ['No', 'Yes']) #set 0 as 'No', 1 as 'Yes' and show as legend.
plt.show()

# seaborn heatmap shows correlation between each variables.
sns.heatmap(df.corr(), square = True, cmap = 'RdYlGn')

KNN

K-Nearest Neighbors

  • Basic idea: Predict the label of a data point by
    • Looking at the k closest labeled data points
    • Taking a majority vote
  • Model Complexity
    • larger k = smoother decision boundary = less complex model
    • smaller k = more complex model = can lead to overfitting and sensitive to noise
    • Use Model Complexity curve to decide your best k

Measuring Model Performance

  • Split data into training and test set
  • Fit the classifier on the training set
  • Make predictions on the test set
  • Compare predictions with the known labels
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size = 0.2, # percent of data in test set
random_state = 42, # set random seed
stratify = y) #Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset.

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 7)

# Fit the classifier to the training data
knn.fit(X_train, y_train) #3 requirements: 1.np array or df. 2. continuous value. 3. No missing data

# Print the accuracy
print(knn.score(X_test, y_test))

Model Complexity curve

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
# Setup a k-NN Classifier with k neighbors: knn
knn = KNeighborsClassifier(n_neighbors = k)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

#Compute accuracy on the training set
train_accuracy[i] = knn.score(X_train, y_train)

#Compute accuracy on the testing set
test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()

Regression

Importing data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df.life.values #transform as np array
X = df.fertility.values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape)) #(139, )

# Reshape X and y (since we are only using one feature to predict)
y = y.reshape(-1, 1)
X = X.reshape(-1, 1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape)) # (139, 1)

Linear Regression

  • y = ax + b
    • y = target
    • x = single feature
    • a, b = parameters of model
  • How to choose a and b?
    • Define an error/loss/cost function for any given line
    • Choose the line that minimizes the error function

Using only one feature to predict.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the regressor: reg
reg = LinearRegression()

# Create the prediction space
prediction_space = np.linspace(
min(X_fertility),
max(X_fertility)
).reshape(-1, 1)

# Fit the model to the data
reg.fit(X_fertility, y)

# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)

# Print R^2
print(reg.score(X_fertility, y))

# Plot regression line
plt.plot(prediction_space, y_pred, color = 'black', linewidth = 3)
plt.show()

Using the whole dataset to predict with training and testing set split.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

Cross-validation

To prevent your metric of choice being dependent on the train test split, we can use k-fold cross validation. The larger the k, the more computationality expensive it is. Therefore, it is common practice to choose 5 or 10 folds.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import the necessary modules
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv = 5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

You can use %timeit to see how long each 3-fold CV takes compared to 10-fold CV by executing the following cv = 3 and cv = 10

1
%timeit cross_val_score(reg, X, y, cv = ____)

Regularization

Linear regression is essentially minimizing a loss function, during which it needs to choose a coefficient $a_i$ for each feature variable. If we allow this coefficients or parameters to be super large, we can get overfitting, especially when it is in a high dimensional space. For this reason, it is common practice to alter the loss function so that it penalizes for large coefficients.

Ridge Regression

$L2$ regularization

  • Loss function = OLS loss function + $\alpha * \sum\limits_{i=1}^na^2_i$

With this loss function, when minimizing the loss function to fit to our data, models are penalized for coefficients with a large magnitude: large positive and negative coefficients.

Therefore, $\alpha$ (which is often called $\lambda$ in the wild) is a hyperparameter here that needs tuning. $\alpha$ controls model complexity.

  • $\alpha = 0$: original OLS. Can lead to overfitting
  • $\alpha = \infin$: can lead to underfitting
1
2
3
4
from sklearn.linear_model import Ridge
ridge = Ridge(
alpha = 0.1,
normalize = True) # all variables are on the same scale

Lasso Regression

$L1$ regularization

  • Loss function = OLS loss function + $\alpha * \sum\limits_{i=1}^n|a_i|$

Lasso regression can be usde to select important features of a dataset, because it shrinks the coefficients of less important features to exactly 0. The features whose coefficients are not shrunk to zero are ‘selected’ by the LASSO algorithm. Therefore, it is a very good method to use in reality for reporting purposes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha = 0.4, normalize = True)

# Fit the regressor to the data
lasso.fit(X, y)

# Compute and print the coefficients
lasso_coef = lasso.fit(X, y).coef_
print(lasso_coef)

# plotting the coefficients as a function of feature name
_ = plt.plot(range(len(names)), lasso_coef)
_ = plt.xticks(range(len(names)), names, rotation = 60)
_ = plt.ylabel('Coefficients')
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize = True)

# Compute scores over range of alphas
for alpha in alpha_space:

# Specify the alpha value to use: ridge.alpha
ridge.alpha = alpha

# Perform 10-fold CV: ridge_cv_scores
ridge_cv_scores = cross_val_score(ridge, X, y, cv = 10)

# Append the mean of ridge_cv_scores to ridge_scores
ridge_scores.append(np.mean(ridge_cv_scores))

# Append the std of ridge_cv_scores to ridge_scores_std
ridge_scores_std.append(np.std(ridge_cv_scores))

# Define a funtcion to draw the plot
def display_plot(cv_scores, cv_scores_std):
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(alpha_space, cv_scores)

std_error = cv_scores_std / np.sqrt(10)

ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha = 0.2)
ax.set_ylabel('CV Score +/- Std Error')
ax.set_xlabel('Alpha')
ax.axhline(np.max(cv_scores), linestyle = '--', color = '.5')
ax.set_xlim([alpha_space[0], alpha_space[-1]])
ax.set_xscale('log')
plt.show()

# Display the plot to see how cv scores vary across different alpha
display_plot(ridge_scores, ridge_scores_std)

Elastic Net

L1 ratio regularization

  • Loss function = OLS loss function + $\alpha L1 + b L2$

In scikit-learn, this term is represented by the 'l1_ratio' parameter: An 'l1_ratio' of 1 corresponds to an L1 penalty, and anything lower is a combination of $L1$ and $L2$.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(elastic_net, param_grid, cv = 5)

# Fit it to the training data
gm_cv.fit(X_train, y_train)

# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)

print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

Fine-tuning the Model

Classification Model Evaluation

Normally, accuracy is a good enough metric to evaluate the performance. However, for imbalanced dataset, when we are trying to predict on a small amount of positively labeled data, accuracy rate cannot reflect the true “performance” of the model.

Confusion matrix

Predicted:Spam Email Predicted: Real Email
Actual: Spam Email True Positive False Negative
Actual: Real Email False Positive True Negative
  • TP rate = $\frac{tp}{tp + fn}$
  • FP rate = $\frac{fp}{fp + tn}$ = 1 - Specificity
  • Accuracy = $\frac{tp + tn}{tp + tn + fp + fn}$
  • Precision = $\frac{tp}{tp + fp}$
    • High precision means: In emails that are predicted as spam, most of them are predicted correctly.
  • Recall = $\frac{tp}{tp + fn}$, which is also the True Positvie Rate and Sensitivity.
    • High recall means: In the actual spam emails, most of them were predicted correctly.
  • F1 Score = $2 \frac{\text{precision} \text{recall}}{\text{precision} + \text{recall}}$ The harmonic mean of precision and recall.
  • the classification report consisted of three rows, and an additional support column. The support gives the number of samples of the true response that lie in that class.
  • Sensitivity = recall = TP rate = $\frac{tp}{tp + fn}$
  • Specificity = selectivity = TN rate = $\frac{tn}{tn + fn}$
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Import packages
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Logistic Regression

Logistic regression for binary classification

  • Logistic regression outputs probabilities.
    • If the probability ‘p’ is greater than 0.5, the data is labeled ‘1’
    • If the probability ‘p’ is less than 0.5, the data is labeled ‘0’
  • The above rules create a linear decision boundary.
  • 0.5 is actually a threshold and can be changed according to different scenarios. Here, we use the ROC curve to help us decide the threshold.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

ROC Curve

  • X-axis: FP rate, Y-axis: TP rate
  • When threshold = 0, the model predicts 1 for all data points, then the FP and TP rates are all 1.
  • When threshold = 1, the model predicts 0 for all data points, then the FP and TP rates are all 0.
  • If we vary the threshold between the two extremes, we get a series of different fp and tp rates. The curve we get when trying all possible thresholds is called the Recevier Operating Characteristic curve or ROC curve.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Import necessary modules
from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Precision-recall Curve can also serve as a similar purpose.

Area under the ROC curve (AUC)

Larger area under the ROC curve = better model
The idea is that: if we can produce a model with tp rate = 1 and fp rate = 0, it would be magnificent! Therefore, we can see that the area under the ROC curve, commonly denoted as AUC, is another popular metric for classification models.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Compute and print AUC score
print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))

# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv = 5, scoring = 'roc_auc')

# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))

Hyperparameter Tuning

Hyperparameters are the ones that cannot be learned by fitting the model. The steps to choose the correct parameters are as follows:

  1. Try a bunch of different hyperparameter values
  2. Fit all of them separately
  3. See how well each performs (make sure to us cv to avoid overfitting)
  4. Choose the best performing one

When implementing the tuning, we generally use grid search method.

Like the $\alpha$ parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter: $C$. $C$ controls the inverse of the regularization strength, and this is what you will tune below.

A large $C$ can lead to an overfit model, while a small $C$ can lead to an underfit model.

In addition to $C$, logistic regression has a 'penalty' hyperparameter which specifies whether to use $L1$ or $L2$ regularization.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)

# Fit it to the training data
logreg_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))

GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
"max_features": randint(1, 9), #Return a random integer N such that 1 <= N <= 9.
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv = 5)

# Fit it to the data
tree_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Hold-out Set

How well can the model perform on never before seen data, given my scoring function of choice?

We should split the dataset into training and testing before training the model and carry out the cv, so that we can evaluate the final performance of our model on a never-seen-dataset after we finalized the hyperparameters, which is the testing set/hold-out set.

Preprocessing and Pipelines

Categorical features

Dummy coding:

  • 0: Observation that does not belong to this category
  • 1: Observation that belongs to this category
  • Remember to remove one variable after dummy coding to prevent duplicating information. Since if there are only 3 categories and we know sth is not A or B, then it must be C. This can be achieved by specifying drop_first=True in pd.get_dummies()
1
2
3
4
5
6
7
8
9
10
11
12
13
# Import necessary modules
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first = True)

# Print the new columns of df_region
print(df_region.columns)

Missing data

Even when df.info() suggests that all columns are not null, there are still many other possible forms of missing data. For instance, 0, ?, !, .. ,etc.

Ways to deal with missing data:

  1. Dropping missing data: df.dropna() (generally a bad idea)
  2. Imputing missing data
    1. Using the mean of the non-missing entries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Convert '?' to NaN
df[df == '?'] = np.nan

# Print the number of NaNs
print(df.isnull().sum())

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

# Drop missing values and print shape of new DataFrame
df = df.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))

Pipeline: each step but the last must be a transformer and the last must be an estimator, such as, a classifier or a regressor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the Imputation transformer: imp
imp = Imputer(missing_values = 'NaN', strategy = 'most_frequent', axis = 0)

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
('SVM', clf)]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Compute metrics
print(classification_report(y_test, y_pred))

Centering and Scaling

Why we need scaling/normalizing?

  • Many models use some form of distance to inform them. Therefore, features on larger scales can unduly influence the model.
  • We want features to be on a similar scale.

Ways to normalize data:

  • Standardization: Subtract the mean and divide by variance.
    • All features are centered around zero and have variance one.
  • Subtract the minumum and divide by the range.
    • Minimum zero and maximum one
  • Normalize the range from -1 to +1
1
2
3
4
5
6
7
8
9
10
11
12
13
# Import scale
from sklearn.preprocessing import scale

# Scale the features: X_scaled
X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X)))
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled)))
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

Using pipeline to scale the data together with modeling.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
('knn', KNeighborsClassifier())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))

All together

Pipeline for classification

It is time now to piece together everything you have learned so far into a pipeline for classification! Your job in this exercise is to build a pipeline that includes scaling and hyperparameter tuning to classify wine quality.

You’ll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are $C$ and $\gamma$ (gamma). $C$ controls the regularization strength. It is analogous to the $C$ you tuned for logistic regression, while $\gamma$ controls the kernel coefficient.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# Setup the pipeline
steps = [('scaler', StandardScaler()),
('SVM', SVC())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space, 'step_name__parameter_name'
parameters = {'SVM__C':[1, 10, 100],
'SVM__gamma':[0.1, 0.01]}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21)

# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv = 3)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

Pipeline for regression

Your job is to build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the l1_ratio of your ElasticNet using GridSearchCV.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
('scaler', StandardScaler()),
('elasticnet', ElasticNet())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, parameters, cv = 3)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))