NLP project in Python to classify items into different categories. (DataCamp).

Introduction to the Challange

Budgets for schools are huge, complex, and not standardized. Hundredds of hours each year are spent manually labelling.

Goal: Build a machine learning algorithm that can autiomate the process
Dataset:
- Line-item: text description of each item
- 9 Target variables: labels like ‘Textbooks’, ‘Math’, ‘Middle School’
Supervised classification problem.

We want to build a human-in-loop machine learning system. We don’t want to make direct prediction on whether each item is A or B, but want to say that “we are 60% sure that is belongs to A. If not, we are 30% sure it belongs to B…”.

Basic EDA

I removed the frequently used ones for simplicity. e.g. df.describe(), df.info()…, etc.

import pandas as pd
df = pd.read_csv('TrainingData.csv')

# Counts the number of different data types
df.dtypes.value_counts()

# Set columns that should be categorical variables
LABELS = ['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type',
 'Pre_K',
 'Operating_Status']

# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis = 0)

# Print the converted dtypes
print(df[LABELS].dtypes)

Check the number of labels under each category:

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Calculate number of unique values for each label: num_unique_labels
num_unique_labels = df[LABELS].apply(pd.Series.nunique)

# Plot number of unique values for each label
num_unique_labels.plot(kind = 'bar')

# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')

# Display the plot
plt.show()

Performance Measurement

The metric used in this problem is log loss. It is a loss function and a measure of error. Our goal is to minimize the error with our model.

Log Loss for binary classification
- Actual value: $y$ = {1 = yes, 0 = no}
- Prediction (Probability that the value is 1): $p$
- $logloss = -\frac{1}{N}\sum\limits_{i=1}^{N}(y_i\log(p_i) + (1-y_i)\log(1-p_i))$
- The function sets that it is better to be less confident than confident and wrong, i.e., a high probability is assigned to the incorrect class.

import numpy as np
def compute_log_loss(predicted, actual, eps = 1e-14):
    """ Computes the logarithmic loss between predicted and
    actual when these are 1D arrays.

    :param predicted: The predicted probabilities as floats between 0-1
    :param actual: The actual binary labels. Either 0 or 1.
    :param eps (optional): log(0) is inf, so we need to offset our
    predicted values slightly by eps from 0 or 1.
    """
    predicted = np.clip(predicted, eps, 1 - eps)
    loss = -1 * np.mean(actual * np.log(predicted)
              + (1 - actual) 
              * np.log(1 - predicted))

    return loss

We can test the logic of our loss function by the following tests:

# Set up cases for testing
actual_labels = np.array([1., 1., 1., 1., 1., 0., 0., 0., 0., 0.])
correct_confident = np.array([0.95, 0.95, 0.95, 0.95, 0.95, 0.05, 0.05, 0.05, 0.05, 0.05])
correct_not_confident = np.array([0.65, 0.65, 0.65, 0.65, 0.65, 0.35, 0.35, 0.35, 0.35, 0.35])
wrong_not_confident = np.array([0.35, 0.35, 0.35, 0.35, 0.35, 0.65, 0.65, 0.65, 0.65, 0.65])
wrong_confident = np.array([0.05, 0.05, 0.05, 0.05, 0.05, 0.95, 0.95, 0.95, 0.95, 0.95])

# Compute and print log loss for 1st case
correct_confident_loss = compute_log_loss(correct_confident, actual_labels)
print("Log loss, correct and confident: {}".format(correct_confident_loss)) 

# Compute log loss for 2nd case
correct_not_confident_loss = compute_log_loss(correct_not_confident, actual_labels)
print("Log loss, correct and not confident: {}".format(correct_not_confident_loss)) 

# Compute and print log loss for 3rd case
wrong_not_confident_loss = compute_log_loss(wrong_not_confident, actual_labels)
print("Log loss, wrong and not confident: {}".format(wrong_not_confident_loss)) 

# Compute and print log loss for 4th case
wrong_confident_loss = compute_log_loss(wrong_confident, actual_labels)
print("Log loss, wrong and confident: {}".format(wrong_confident_loss)) 

# Compute and print log loss for actual labels
actual_labels_loss = compute_log_loss(actual_labels, actual_labels)
print("Log loss, actual labels: {}".format(actual_labels_loss)) 

# <script.py> output:
#     Log loss, correct and confident: 0.05129329438755058
#     Log loss, correct and not confident: 0.4307829160924542
#     Log loss, wrong and not confident: 1.049822124498678
#     Log loss, wrong and confident: 2.9957322735539904
#     Log loss, actual labels: 9.99200722162646e-15

We can see that log loss penalizes highly confident wrong answers much more than any other type. This will be a good metric to use on your models.

Create a Simple Model

Many more things can go wrong in complex models. It is always a good approach to start with a very simple model, which can gives a sense of how challenging the problem is, and how much signal can we pull out using basic methods.

Model with Numeric Data Only

Basic model outline:

Train basic model on numeric data only: we want to go from raw data to predictions quickly.
Multi-class logistic regression
- Train classifier on each label separately and use those to predict

Splitting the multi-class dataset is a little tricky in this case.

We have multiple target variables, and some of the labels have very few data points.
Solution: StratifiedShuffleSplit –> multilabel_train_test_split() (detailedly set in this link)
OneVsRestClassifier()
- Treats each column of y independently
- Fits a separate classifier for each of the columns

The first step is to split the data into a training set and a test set. Some labels don’t occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count examples of each label appear in each split: multilabel_train_test_split.

# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Create the DataFrame: numeric_data_only
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])

# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(
    numeric_data_only,
    label_dummies,
    size = 0.2, 
    seed = 123)

# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())

# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Load the holdout data: holdout
holdout = pd.read_csv('HoldoutData.csv', index_col = 0)

# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))

# Format predictions in DataFrame: prediction_df
prediction_df = pd.DataFrame(
    columns = pd.get_dummies(df[LABELS]).columns,
    index = holdout.index,
    data = predictions)

# Save prediction_df to csv
prediction_df.to_csv('predictions.csv')

# Submit the predictions for scoring: score
score = score_submission(pred_path = 'predictions.csv')

# Print score
print('Your model, trained with numeric data only, yields logloss score: {}'.format(score))

Introduction to NLP

Tokenization

Splitting a string into segments
- The rule of separation can be customized, e.g. by space, or comma, or a combination of several ones.
Store segments as list
e.g. ‘Natural Language Processing’ –> [‘Natural’, ‘Language’, ‘Processing’]

Bag of Words

Count the number of times a particular token appears.
But discards information about word order.

CountVectorizer()

Tokenizes all strings
Builds a ‘vocabulary’

Counts the occurences of each token in the vocabulary.

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the token pattern: TOKENS_ALPHANUMERIC (creating tokens that contain only alphanumeric characters)
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #refer to resources online for how to define this pattern

# Fill missing values in df.Position_Extra
df.Position_Extra.fillna('', inplace = True)

# Instantiate the CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern = TOKENS_ALPHANUMERIC)

# Fit to the data
vec_alphanumeric.fit(df.Position_Extra)

# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric"
print(msg.format(len(vec_alphanumeric.get_feature_names())))
print(vec_alphanumeric.get_feature_names()[:15])

In order to get a bag-of-words representation for all of the text data in our DataFrame, you must first convert the text data in each row of the DataFrame into a single string.

In the previous exercise, this wasn’t necessary because you only looked at one column of data, so each row was already just a single string. CountVectorizer expects each row to just be a single string, so in order to use all of the text columns, you’ll need a method to turn a list of strings into a single string.

In this exercise, you’ll complete the function definition combine_text_columns(). When completed, this function will convert all training text data in your DataFrame to a single string per row that can be passed to the vectorizer object and made into a bag-of-words using the .fit_transform() method.

Note that the function uses NUMERIC_COLUMNS and LABELS to determine which columns to drop. These lists have been loaded into the workspace.

# Define combine_text_columns()
def combine_text_columns(data_frame, to_drop = NUMERIC_COLUMNS + LABELS):
    """ converts all text in each row of data_frame to single vector """
    
    # Drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist()) #取交集
    text_data = data_frame.drop(to_drop, axis = 1)
    
    # Replace nans with blanks
    text_data.fillna("", inplace = True)
    
    # Join all text items in a row that have a space in between
    return text_data.apply(lambda x: " ".join(x), axis = 1)

# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the basic token pattern
TOKENS_BASIC = '\\S+(?=\\s+)'

# Create the alphanumeric token pattern
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate basic CountVectorizer: vec_basic
vec_basic = CountVectorizer(token_pattern = TOKENS_BASIC)

# Instantiate alphanumeric CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern = TOKENS_ALPHANUMERIC)

# Create the text vector
text_vector = combine_text_columns(df)

# Fit and transform vec_basic
vec_basic.fit_transform(text_vector)

# Print number of tokens of vec_basic
print("There are {} tokens in the dataset".format(len(vec_basic.get_feature_names())))

# Fit and transform vec_alphanumeric
vec_alphanumeric.fit_transform(text_vector)

# Print number of tokens of vec_alphanumeric
print("There are {} alpha-numeric tokens in the dataset".format(len(vec_alphanumeric.get_feature_names())))

N-grams

N-gram means to include n consecutive words in each segment.
Maintain the information about word order.

Model Improvements

Pipeline

Pipeline is a repeatable way to go from raw data to trained model.

Pipeline object takes sequential list of steps. Output of one step is input to next step
We can even have a sub-pipeline as one of the steps
Each step is a tuple with two elements:
- Name: string
- Transform: obj implementing .fit() and .transform()

# Import Pipeline
from sklearn.pipeline import Pipeline

# Import other necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Import the Imputer object
from sklearn.preprocessing import Imputer

# Create training and test sets using only numeric data
X_train, X_test, y_train, y_test = train_test_split(
    sample_df[['numeric', 'with_missing']],
    pd.get_dummies(sample_df['label']), 
    random_state = 456)

# Insantiate Pipeline object: pl
pl = Pipeline([
        ('imp', Imputer()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit the pipeline to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all numeric, incl nans: ", accuracy)

Preprocessing Multiple Dtypes

We definitely want to use all available features in one pipeline.

Problem: the pipeline steps for numeric and text preprocessing can’t follow each other.
- e.g., output of CountVectorizer can’t be input to Imputer.
Solution: Function Transformer() & FeartuerUnion()
Function Transformer
- Turns a Python function into an object that a scikit-learn pipeline can understand
- Need to write two functions for pipeline preprocessing
  - Take entire DF, return numeric columns
  - Take entire DF, return text columns
- Can then preprocess numeric and text data in separate pipelines.
- validate = False indicates no need to check the input’s data types or missing values.
FeatureUnion
- Combine two sets of features together as a single array, which will be the input to our classifier.

# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Obtain the text data: get_text_data
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)

# Obtain the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[['numeric', 'with_missing']], validate = False)

# Fit and transform the text data: just_text_data
just_text_data = get_text_data.fit_transform(sample_df)

# Fit and transform the numeric data: just_numeric_data
just_numeric_data = get_numeric_data.fit_transform(sample_df)

# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Split using ALL data in sample_df
X_train, X_test, y_train, y_test = train_test_split(
    sample_df[['numeric', 'with_missing', 'text']],
    pd.get_dummies(sample_df['label']), 
    random_state = 22)

# Create a FeatureUnion with nested pipeline: process_and_join_features
process_and_join_features = FeatureUnion(
    transformer_list = [
        ('numeric_features', Pipeline([
            ('selector', get_numeric_data),
            ('imputer', Imputer())
        ])),
        ('text_features', Pipeline([
            ('selector', get_text_data),
            ('vectorizer', CountVectorizer())
        ]))
    ]
)

# Instantiate nested pipeline: pl
pl = Pipeline([
        ('union', process_and_join_features),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit pl to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all data: ", accuracy)

Choose a Classification Model

The flexibility of the pipeline structure allows us to quickly try different models, since we only need to edit the model step, and leave the preprocessing steps unchanged.

# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])

# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(
    df[NON_LABELS],
    dummy_labels,
    0.2, 
    seed = 123)

# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate = False)

# Preprocess the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate = False)

# Complete the pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

Change to Random Forest with one parameter specified.

# Import random forest classifer
from sklearn.ensemble import RandomForestClassifier

# Edit model step in pipeline
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', RandomForestClassifier(n_estimators = 15))
    ])

# Fit to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)

Expert Tricks

Text Preprocessing

NLP tricks for text data
- Tokenize on punctuation to avoid hyphens, underscores, etc.
- Include unigrams and bi-grams in the model to capture important information involving multiple tokens - e.g., ‘middle school’

Special functions: You’ll notice a couple of new steps provided in the pipeline in this and many of the remaining exercises. Specifically, the dim_red step following the vectorizer step , and the scale step preceeding the clf (classification) step.

These have been added in order to account for the fact that you’re using a reduced-size sample of the full dataset in this course. To make sure the models perform as the expert competition winner intended, we have to apply a dimensionality reduction technique, which is what the dim_red step does, and we have to scale the features to lie between -1 and 1, which is what the scale step does.

The dim_red step uses a scikit-learn function called SelectKBest(), applying something called the chi-squared test to select the K “best” features. The scale step uses a scikit-learn function called MaxAbsScaler() in order to squash the relevant features into the interval -1 to 1.

# Import pipeline
from sklearn.pipeline import Pipeline

# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Import other preprocessing modules
from sklearn.preprocessing import Imputer
from sklearn.feature_selection import chi2, SelectKBest

# Select 300 best features
chi_k = 300

# Import functional utilities
from sklearn.preprocessing import FunctionTransformer, MaxAbsScaler
from sklearn.pipeline import FeatureUnion

# Perform preprocessing
get_text_data = FunctionTransformer(combine_text_columns, validate = False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate = False)

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer(
                        token_pattern = TOKENS_ALPHANUMERIC,
                        ngram_range = (1, 2))),
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

Interaction Terms

Interatiomn terms let us mathametically describe when tokens appear together. In scikit-learn, it is called PolynomialFeartures()

from sklearn.preprocessing import PolynomialFeartures
interaction = PolynomialFeatures(
    degree = 2,
    interaction_only = True, #does not multiple the column by itself
    include_bias = False #whether to include bias term in our model
)

Bias term allows model to have non-zero y value when x value is zero.

e.g. A baby has weight once it’s born.

The number of interaction terms grows exponentially. Our vectorizer saves memory by using a sparse matrix. However, PolynomialFeartures does not support sparse matrices. SparseInteractions() does. You can get the code for SparseInteractions at this GitHub Gist.


# Instantiate pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                                                   ngram_range=(1, 2))),  
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('int', SparseInteractions(degree = 2)),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

Hashing

Adding new features may cause enormous increase in array size. As the array grows, we need more computational power to complete our calculation. The “Hashing” trick is a way of increasing memory efficiency, by limiting the size of the matrix without sacrificing too much model accuracy.

A hash function takes an imput, in this case a token, and outputs a hash value. For example, the input may be a string and the hash value may be an integer. The original paper about the hashing function demonstrates that even if two tokens hashed to the same value, there is very little effect on model accuracy in real world problems.

Hashing is extremely useful when it comes to dimension reduction. Some problems are memory-bound and not easily parallelizable, and hashing enforces a fixed length computation instead of using a mutable datatype (like a dictionary). Here, instead of using the CountVectorizer(), which creates the bag of words representation, we change to HashingVectorizer().

In the end, the model that won the competition is the simple Logistic Regression. It shows that it is not the complex algorithm that matters the most, but the feature constructions and the implementing tricks.

The scikit-learn implementation of HashingVectorizer:

# Import HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer 

# Get text data: text_data
text_data = combine_text_columns(X_train)

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' 

# Instantiate the HashingVectorizer: hashing_vec
hashing_vec = HashingVectorizer(token_pattern = TOKENS_ALPHANUMERIC)

# Fit and transform the Hashing Vectorizer
hashed_text = hashing_vec.fit_transform(text_data)

# Create DataFrame and print the head
hashed_df = pd.DataFrame(hashed_text.data)
print(hashed_df.head())

Using HashingVectorizer in a pipeline:

# Import the hashing vectorizer
from sklearn.feature_extraction.text import HashingVectorizer

# Instantiate the winning model pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', HashingVectorizer(
                        token_pattern = TOKENS_ALPHANUMERIC,
                        non_negative = True, 
                        norm = None, 
                        binary = False,
                        ngram_range = (1, 2))),
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('int', SparseInteractions(degree = 2)),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

If you want to use this model locally, this Jupyter notebook contains all the code you’ve worked so hard on. You can now take that code and build on it!

To Do Better

NLP: stemming, stop-word removel
Model: RandomForest, k-NN, Naive Bayes
Numeric Preprocessing: Imputation strategies
Optimization: Grid search over pipeline objects
Experiment with new scikit-learn techniques