NLP project in Python to classify items into different categories. (DataCamp).
Introduction to the Challange
Budgets for schools are huge, complex, and not standardized. Hundredds of hours each year are spent manually labelling.
- Goal: Build a machine learning algorithm that can autiomate the process
- Dataset:
- Line-item: text description of each item
- 9 Target variables: labels like ‘Textbooks’, ‘Math’, ‘Middle School’
- Supervised classification problem.
We want to build a human-in-loop machine learning system. We don’t want to make direct prediction on whether each item is A or B, but want to say that “we are 60% sure that is belongs to A. If not, we are 30% sure it belongs to B…”.
Basic EDA
I removed the frequently used ones for simplicity. e.g. df.describe()
, df.info()
…, etc.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25import pandas as pd
df = pd.read_csv('TrainingData.csv')
# Counts the number of different data types
df.dtypes.value_counts()
# Set columns that should be categorical variables
LABELS = ['Function',
'Use',
'Sharing',
'Reporting',
'Student_Type',
'Position_Type',
'Object_Type',
'Pre_K',
'Operating_Status']
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')
# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis = 0)
# Print the converted dtypes
print(df[LABELS].dtypes)
Check the number of labels under each category:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15# Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Calculate number of unique values for each label: num_unique_labels
num_unique_labels = df[LABELS].apply(pd.Series.nunique)
# Plot number of unique values for each label
num_unique_labels.plot(kind = 'bar')
# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')
# Display the plot
plt.show()
Performance Measurement
The metric used in this problem is log loss. It is a loss function and a measure of error. Our goal is to minimize the error with our model.
- Log Loss for binary classification
- Actual value: $y$ = {1 = yes, 0 = no}
- Prediction (Probability that the value is 1): $p$
- $logloss = -\frac{1}{N}\sum\limits_{i=1}^{N}(y_i\log(p_i) + (1-y_i)\log(1-p_i))$
- The function sets that it is better to be less confident than confident and wrong, i.e., a high probability is assigned to the incorrect class.
1 | import numpy as np |
We can test the logic of our loss function by the following tests:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33# Set up cases for testing
actual_labels = np.array([1., 1., 1., 1., 1., 0., 0., 0., 0., 0.])
correct_confident = np.array([0.95, 0.95, 0.95, 0.95, 0.95, 0.05, 0.05, 0.05, 0.05, 0.05])
correct_not_confident = np.array([0.65, 0.65, 0.65, 0.65, 0.65, 0.35, 0.35, 0.35, 0.35, 0.35])
wrong_not_confident = np.array([0.35, 0.35, 0.35, 0.35, 0.35, 0.65, 0.65, 0.65, 0.65, 0.65])
wrong_confident = np.array([0.05, 0.05, 0.05, 0.05, 0.05, 0.95, 0.95, 0.95, 0.95, 0.95])
# Compute and print log loss for 1st case
correct_confident_loss = compute_log_loss(correct_confident, actual_labels)
print("Log loss, correct and confident: {}".format(correct_confident_loss))
# Compute log loss for 2nd case
correct_not_confident_loss = compute_log_loss(correct_not_confident, actual_labels)
print("Log loss, correct and not confident: {}".format(correct_not_confident_loss))
# Compute and print log loss for 3rd case
wrong_not_confident_loss = compute_log_loss(wrong_not_confident, actual_labels)
print("Log loss, wrong and not confident: {}".format(wrong_not_confident_loss))
# Compute and print log loss for 4th case
wrong_confident_loss = compute_log_loss(wrong_confident, actual_labels)
print("Log loss, wrong and confident: {}".format(wrong_confident_loss))
# Compute and print log loss for actual labels
actual_labels_loss = compute_log_loss(actual_labels, actual_labels)
print("Log loss, actual labels: {}".format(actual_labels_loss))
# <script.py> output:
# Log loss, correct and confident: 0.05129329438755058
# Log loss, correct and not confident: 0.4307829160924542
# Log loss, wrong and not confident: 1.049822124498678
# Log loss, wrong and confident: 2.9957322735539904
# Log loss, actual labels: 9.99200722162646e-15
We can see that log loss penalizes highly confident wrong answers much more than any other type. This will be a good metric to use on your models.
Create a Simple Model
Many more things can go wrong in complex models. It is always a good approach to start with a very simple model, which can gives a sense of how challenging the problem is, and how much signal can we pull out using basic methods.
Model with Numeric Data Only
Basic model outline:
- Train basic model on numeric data only: we want to go from raw data to predictions quickly.
- Multi-class logistic regression
- Train classifier on each label separately and use those to predict
Splitting the multi-class dataset is a little tricky in this case.
- We have multiple target variables, and some of the labels have very few data points.
- Solution: StratifiedShuffleSplit –> multilabel_train_test_split() (detailedly set in this link)
OneVsRestClassifier()
- Treats each column of y independently
- Fits a separate classifier for each of the columns
The first step is to split the data into a training set and a test set. Some labels don’t occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count
examples of each label appear in each split: multilabel_train_test_split
.
1 | # Import classifiers |
Introduction to NLP
Tokenization
- Splitting a string into segments
- The rule of separation can be customized, e.g. by space, or comma, or a combination of several ones.
- Store segments as list
- e.g. ‘Natural Language Processing’ –> [‘Natural’, ‘Language’, ‘Processing’]
Bag of Words
- Count the number of times a particular token appears.
- But discards information about word order.
CountVectorizer()
- Tokenizes all strings
- Builds a ‘vocabulary’
- Counts the occurences of each token in the vocabulary.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Create the token pattern: TOKENS_ALPHANUMERIC (creating tokens that contain only alphanumeric characters)
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #refer to resources online for how to define this pattern
# Fill missing values in df.Position_Extra
df.Position_Extra.fillna('', inplace = True)
# Instantiate the CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern = TOKENS_ALPHANUMERIC)
# Fit to the data
vec_alphanumeric.fit(df.Position_Extra)
# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric"
print(msg.format(len(vec_alphanumeric.get_feature_names())))
print(vec_alphanumeric.get_feature_names()[:15])
In order to get a bag-of-words representation for all of the text data in our DataFrame, you must first convert the text data in each row of the DataFrame into a single string.
In the previous exercise, this wasn’t necessary because you only looked at one column of data, so each row was already just a single string. CountVectorizer
expects each row to just be a single string, so in order to use all of the text columns, you’ll need a method to turn a list of strings into a single string.
In this exercise, you’ll complete the function definition combine_text_columns()
. When completed, this function will convert all training text data in your DataFrame to a single string per row that can be passed to the vectorizer object and made into a bag-of-words using the .fit_transform()
method.
Note that the function uses NUMERIC_COLUMNS
and LABELS
to determine which columns to drop. These lists have been loaded into the workspace.
1 | # Define combine_text_columns() |
N-grams
- N-gram means to include n consecutive words in each segment.
- Maintain the information about word order.
Model Improvements
Pipeline
Pipeline is a repeatable way to go from raw data to trained model.
- Pipeline object takes sequential list of steps. Output of one step is input to next step
- We can even have a sub-pipeline as one of the steps
- Each step is a tuple with two elements:
- Name: string
- Transform: obj implementing
.fit()
and.transform()
1 | # Import Pipeline |
Preprocessing Multiple Dtypes
We definitely want to use all available features in one pipeline.
- Problem: the pipeline steps for numeric and text preprocessing can’t follow each other.
- e.g., output of CountVectorizer can’t be input to Imputer.
- Solution: Function Transformer() & FeartuerUnion()
- Function Transformer
- Turns a Python function into an object that a scikit-learn pipeline can understand
- Need to write two functions for pipeline preprocessing
- Take entire DF, return numeric columns
- Take entire DF, return text columns
- Can then preprocess numeric and text data in separate pipelines.
validate = False
indicates no need to check the input’s data types or missing values.
- FeatureUnion
- Combine two sets of features together as a single array, which will be the input to our classifier.
1 | # Import FunctionTransformer |
Choose a Classification Model
The flexibility of the pipeline structure allows us to quickly try different models, since we only need to edit the model step, and leave the preprocessing steps unchanged.
1 | # Import FunctionTransformer |
Change to Random Forest with one parameter specified.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26# Import random forest classifer
from sklearn.ensemble import RandomForestClassifier
# Edit model step in pipeline
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', CountVectorizer())
]))
]
)),
('clf', RandomForestClassifier(n_estimators = 15))
])
# Fit to the training data
pl.fit(X_train, y_train)
# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)
Expert Tricks
Text Preprocessing
- NLP tricks for text data
- Tokenize on punctuation to avoid hyphens, underscores, etc.
- Include unigrams and bi-grams in the model to capture important information involving multiple tokens - e.g., ‘middle school’
Special functions: You’ll notice a couple of new steps provided in the pipeline in this and many of the remaining exercises. Specifically, the dim_red
step following the vectorizer
step , and the scale
step preceeding the clf
(classification) step.
These have been added in order to account for the fact that you’re using a reduced-size sample of the full dataset in this course. To make sure the models perform as the expert competition winner intended, we have to apply a dimensionality reduction technique, which is what the dim_red
step does, and we have to scale the features to lie between -1 and 1, which is what the scale
step does.
The dim_red
step uses a scikit-learn function called SelectKBest()
, applying something called the chi-squared test to select the K “best” features. The scale
step uses a scikit-learn function called MaxAbsScaler()
in order to squash the relevant features into the interval -1 to 1.
1 | # Import pipeline |
Interaction Terms
Interatiomn terms let us mathametically describe when tokens appear together. In scikit-learn, it is called PolynomialFeartures()
1 | from sklearn.preprocessing import PolynomialFeartures |
Bias term allows model to have non-zero y value when x value is zero.
- e.g. A baby has weight once it’s born.
The number of interaction terms grows exponentially. Our vectorizer saves memory by using a sparse matrix. However, PolynomialFeartures does not support sparse matrices. SparseInteractions()
does. You can get the code for SparseInteractions at this GitHub Gist.
1 |
|
Hashing
Adding new features may cause enormous increase in array size. As the array grows, we need more computational power to complete our calculation. The “Hashing” trick is a way of increasing memory efficiency, by limiting the size of the matrix without sacrificing too much model accuracy.
A hash function takes an imput, in this case a token, and outputs a hash value. For example, the input may be a string and the hash value may be an integer. The original paper about the hashing function demonstrates that even if two tokens hashed to the same value, there is very little effect on model accuracy in real world problems.
Hashing is extremely useful when it comes to dimension reduction. Some problems are memory-bound and not easily parallelizable, and hashing enforces a fixed length computation instead of using a mutable datatype (like a dictionary). Here, instead of using the CountVectorizer()
, which creates the bag of words representation, we change to HashingVectorizer()
.
In the end, the model that won the competition is the simple Logistic Regression. It shows that it is not the complex algorithm that matters the most, but the feature constructions and the implementing tricks.
The scikit-learn implementation of HashingVectorizer
:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18# Import HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
# Get text data: text_data
text_data = combine_text_columns(X_train)
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
# Instantiate the HashingVectorizer: hashing_vec
hashing_vec = HashingVectorizer(token_pattern = TOKENS_ALPHANUMERIC)
# Fit and transform the Hashing Vectorizer
hashed_text = hashing_vec.fit_transform(text_data)
# Create DataFrame and print the head
hashed_df = pd.DataFrame(hashed_text.data)
print(hashed_df.head())
Using HashingVectorizer
in a pipeline:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27# Import the hashing vectorizer
from sklearn.feature_extraction.text import HashingVectorizer
# Instantiate the winning model pipeline: pl
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', HashingVectorizer(
token_pattern = TOKENS_ALPHANUMERIC,
non_negative = True,
norm = None,
binary = False,
ngram_range = (1, 2))),
('dim_red', SelectKBest(chi2, chi_k))
]))
]
)),
('int', SparseInteractions(degree = 2)),
('scale', MaxAbsScaler()),
('clf', OneVsRestClassifier(LogisticRegression()))
])
If you want to use this model locally, this Jupyter notebook contains all the code you’ve worked so hard on. You can now take that code and build on it!
To Do Better
- NLP: stemming, stop-word removel
- Model: RandomForest, k-NN, Naive Bayes
- Numeric Preprocessing: Imputation strategies
- Optimization: Grid search over pipeline objects
- Experiment with new
scikit-learn
techniques