Unsupervised Learning in Python

Learn how to cluster, transform, visualize, and extract insights from unlabeled datasets using scikit-learn and scipy (DataCamp).

Unsupervised learning finds patterns in data, but without a specific prediction task in mind.

  • e.g. clustering customers by their purchase patterns

Clustering

K-means clustering

  • Finds clusters of samples
  • Number of clusters must be specified
  • New samples can be assigned to existing clusters
  • k-means remembers the mean of each cluster (the “centroids”)
  • Finds the nearest centroid to each new sample

Performance Evaluation

Possible methods:

  1. Compared to the existing labels using a cross table (if there are).

In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.

You have the array samples of grain samples, and a list varieties giving the grain variety for each sample.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Create a KMeans model with 3 clusters: model
model = KMeans(3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)
  1. Measures how spread out the clusters are (Inertia)
    • Inertia: Distance from each sample to centroid of its cluster
    • It is available after fit() as attribute inertia_
    • A good clustering has tight clusters, which means the inertia is low, but also doesn’t have too many clusters.
    • k-means attemps to minimize the inertia when choosing clusters, and more clusters means lower inertia.
      • What is the best neumber of clusters?
        • A good rule of thumb is to choose the elbow in the inertia plot, where the inertia begins to decrease more slowly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ks = range(1, 6)
inertias = []

for k in ks:
# Create a KMeans instance with k clusters: model
model = KMeans(n_clusters = k)

# Fit model to samples
model.fit(samples)

# Append the inertia to the list of inertias
inertias.append(model.inertia_)

# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

Improve Performance

Transforming features

Clustering does not work well on features that have very different variances, since in K-means, feature variance = feature influence.

To solve this problem, we can use StandardScaler to transform each feature to have mean 0 and variance 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import pandas as pd

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters = 4)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)

# Fit the pipeline to samples
pipeline.fit(samples)

# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels':labels, 'species':species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

# Display ct
print(ct)

MaxAbsScaler and Normalizer are other scalers available in the library.

Note that Normalizer() is different to StandardScaler(). While StandardScaler() standardizes features by removing the mean and scaling to unit variance, Normalizer() rescales each sample independently of the other. Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans
import pandas as pd

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(10)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)

# Predict the cluster labels: labels
labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values('labels'))

Exploratory Visualizations

t-SNE

  • t-SNE = “t-distributed stochastic neighbor embedding”
  • Maps samples to 2D space (3D)
  • Map approximately preserves nearness of samples
  • Great for inspecting datasets

Technical details about t-SNE:

  • Only has a fit_transform() method, which simultaneously fits and transforms the data.
    • Does not have fit() and transform(), which means it cannot extend the map to include new data points, and must start over each time.
  • Learning rate is importance in delivering a good model
    • Try values between 50 and 200
    • Wrong choice will lead to points bunch together
  • The axes of the t-SNE plot do not have any interpretable meaning, and are different every time t-SNE is applied, even on the same data.
    • However, the clusters’ positions relative to each other are the same.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Import TSNE
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate = 200)

# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(samples)

# Select the 0th feature: xs
xs = tsne_features[:,0]

# Select the 1st feature: ys
ys = tsne_features[:,1]

# Scatter plot, coloring by variety_numbers
plt.scatter(xs, ys, c = variety_numbers)
plt.show()

Or to quickly extract insights from high-dimensional data (see how some samples are close to each other)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Import TSNE
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate = 50)

# Apply fit_transform to normalized_movements: tsne_features
tsne_features = model.fit_transform(normalized_movements)

# Select the 0th feature: xs
xs = tsne_features[:, 0]

# Select the 1th feature: ys
ys = tsne_features[:, 1]

# Scatter plot
plt.scatter(xs, ys, alpha = 0.5)

# Annotate the points
for x, y, company in zip(xs, ys, companies):
plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
plt.show()

Hierachical Clustering

Arranges samples into hierachical clusters.

Dendrogram
It groups the samples into larger and larger clusters, and should be read from bottom up. Each vertical line represent a cluster, and a joining of lines indicates a merging of clusters.

  • Steps for “agglomerative” hierarchical clustering:
    1. Every country begins in a separate cluster
    2. At each step, the two closest clusters are merged
    3. Continue until all countries in a single cluster
  • “Divisive clustering” works the other way around.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Import normalize
from sklearn.preprocessing import normalize

# Normalize the movements: normalized_movements
#(note that hierarchical clustering cannot be fit into a pipeline
# so we do it this way)
normalized_movements = normalize(movements)

# Calculate the linkage: mergings
mergings = linkage(normalized_movements, method = 'complete')

# Plot the dendrogram
dendrogram(
mergings,
labels = companies,
leaf_rotation = 90,
leaf_font_size = 6
)
plt.show()
  • Height on dendrogram (y-axis) = distance between merging clusters.
  • If we specify the max value of the height, we can ask the model to stop merging clusters once the distance reached the specified value.
  • This is specified by the 'method' parameter in linkage().
    • In complete linkage, the distance between clusters is the distance between the furthest points of the clusters.
    • In single linkage, the distance between clusters is the distance between the closest points of the clusters.

Using different method will create different dendrogram on the same dataset.

We can also extract the cluster labels using the fcluster method.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram

# Calculate the linkage: mergings
mergings = linkage(samples, method = 'single')

# Plot the dendrogram
dendrogram(
mergings,
labels = country_names,
leaf_rotation = 90,
leaf_font_size = 6
)
plt.show()

# Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster

# Use fcluster to extract labels: labels
## speficy a maximum height of 6
labels = fcluster(mergings, 6, criterion = 'distance')

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)

Dimension Reduction

Basics:

  • More efficient storage and computation
  • Remove less-informative “noise” features, which cause problems for prediction tasks, e.g. classification, regression

PCA

PCA = “Principal Component Analysis”, which is a fundamental dime nsion reduction technique. It learns the “pricipal components” of the data, which are the directions that the samples vary the most.

  • Steps:

    1. Decorrelation
      • Rotate data samples to be alinged with axes (decorrelates the data), which are also the direction of the principal components.
        • Note that we are talking about the axes of the point cloud here, not the actual x and y axis on the plot.
      • Shift data samples so they have mean 0
      • No information is lost
    2. Reduces Dimension
      • Discard the low variance “noisy” features, and preserve only the high variance “informative” features.
      • We only need to specify the numbers of PCs to execute the redutcion.
  • Technical Details

    • PCA is a scikit-learn component like KMeans or StandardScaler
      • fit() learns the transformation from given data
      • transform() applies the learned transformation, and can also be applied to new data
    • PCs are available as components_ attribute of PCA object

Step 1: Decorrelation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Import PCA
from sklearn.decomposition import PCA

# Create PCA instance: model
model = PCA()

# Apply the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(grains)

# Assign 0th column of pca_features: xs
xs = pca_features[:,0]

# Assign 1st column of pca_features: ys
ys = pca_features[:,1]

# Scatter plot xs vs ys
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()

# Calculate the Pearson correlation of xs and ys
correlation, pvalue = pearsonr(xs, ys)

# Display the correlation
print(correlation) #0.0

Step 2: Dimension Reduction

  • Intrinsic Dimension
    • Intrinsic dimension = number of features needed to approximate the dataset. It helps to answer “What is the most compact representation of the samples?
    • PCA identifies intrinsic dimension, which is the number of PCA features with significant variance
    • There is not always one correct answer on choosing the number of PCs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Make a scatter plot of the untransformed points
plt.scatter(grains[:,0], grains[:,1])

# Create a PCA instance: model
model = PCA()

# Fit model to points
model.fit(grains)

# Get the mean of the grain samples: mean
mean = model.mean_

# Get the first principal component: first_pc
first_pc = model.components_[0, :]

# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color = 'red', width = 0.01)

# Keep axes on same scale
plt.axis('equal')
plt.show()

Variance of each PC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Create scaler: scaler
scaler = StandardScaler()

# Create a PCA instance: pca
pca = PCA()

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca)

# Fit the pipeline to 'samples'
pipeline.fit(samples)

# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()

Performing reduction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import PCA
from sklearn.decomposition import PCA

# Create a PCA model with 2 components: pca
pca = PCA(n_components = 2)

# Fit the PCA instance to the scaled samples
pca.fit(scaled_samples)

# Transform the scaled samples: pca_features
pca_features = pca.transform(scaled_samples)

# Print the shape of pca_features
print(pca_features.shape)

Special Case

Sometimes, PCA cannot work on a particular type of dataset, like the tf-idf word frequency arrays.

For this, use the TfidfVectorizer from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. It has fit() and transform() methods like other sklearn objects.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer()

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)

# Print result of toarray() method
print(csr_mat.toarray())

# Get the words: words
words = tfidf.get_feature_names()

# Print words
print(words)

This array is “sparse”, because most entries are zero. We can use scipy.sparse.csr_matrix instead of NumPy array. Since csr_matrix remembers only the non-zero entries, it saves lots of memory space.

However, scikit-learn PCA doesn’t support csr_matrix. We use scikit-learn TruncatedSVD instead, which performs the same transformation as PCA.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components = 50)

# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters = 6)

# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)

# Import pandas
import pandas as pd

# Fit the pipeline to articles
pipeline.fit(articles)

# Calculate the cluster labels: labels
labels = pipeline.predict(articles)

# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})

# Display df sorted by cluster label
print(df.sort_values('label'))

NMF

NMF = “non-negative matrix factorization”, is also a dimension reduction technique. Compared with PCA, NMF models are interpretable. However, all sample features must be non-negative (>=0). It achieves its interpretability by decomposing samples as sums of their parts.

Basics:

  • NMF has components, just like PCA has principal components.
  • Dimension of components = number of features in each sample
  • Reconstruction of sample:
    • nmf_features * components = original sample (product of matrices), which can me performed by @ in python 3.5
    • This is the “Matrix Factorization” in NMF

Technical details:

  • Follows fit() / transform() pattern
  • Must specify number of components e.g. NMF(n_components = 2)
  • Works with NumPy arrays and with csr_matrix
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components = 6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Import pandas
import pandas as pd

# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index = titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway', :])

# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington', :])

Notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. In the next video, you’ll see why: NMF components represent topics (for instance, acting!).

Word Frequencies

NMF decomposes documents as combinations of common topics (or “themes”).

  • Suppose we now have a word frequency array, 4 words, many documents.
  • Measure presence of words in each document using “tf-idf”
    • “tf” = frequency of word in document
      • e.g. If 10% of the words are “datacamp”, then the “tf” of the word “datacamp” will be 0.1
    • “idf” = a weighting scheme that reduces the influence of frequent words, like “the”
  • NMF components represent topics, and NMF features combine topics into documents
1
2
3
4
5
6
7
8
9
10
11
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

Previously, you saw that the 3rd NMF feature value was high for the articles about actors Anne Hathaway and Denzel Washington. In this exercise, identify the topic of the corresponding NMF component.

The NMF model you built earlier is available as model, while words is a list of the words that label the columns of the word-frequency array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import pandas
import pandas as pd

# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_, columns = words)

# Print the shape of the DataFrame
print(components_df.shape)

# Select row 3: component
component = components_df.iloc[3,:]

# Print result of nlargest
print(component.nlargest())

Encoded Images

NMF decomposes images as combinations of common patterns.

  • NMF components represents patterns that frequently occur in the images, and NMF features combine patterns into a whole image.
  • “Grayscale” image = no colors, only shades of gray.
    • Since there are only shades of grey, it can be encoded by the brightness of every pixel.
      • Brightness $\in[0,1]$, with 0 as black, 1 as white.
    • Therefore, we can convert a 2D image to a 2D array, with each number representing one pixel.
    • We can then flatten the arrays by enumerate the entries:
      • Read the 2D array row-by-row and from left to right.
    • Then, for a collection of images of the same size, we can further encode them as a 2D array:
      • Each row corresponds to an image
      • Each column corresponds to a location of a pixel
  • The entries are all non-negative, and thus can apply NMF.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Import pyplot
from matplotlib import pyplot as plt

# Select the 0th row: digit
digit = samples[0, :]

# Print digit: an array of 0 and 1s
print(digit)

# Reshape digit to a 13x8 array: bitmap
bitmap = digit.reshape((13, 8))

# Print bitmap
print(bitmap)

# Use plt.imshow to display bitmap
plt.imshow(bitmap, cmap = 'gray', interpolation = 'nearest')
plt.colorbar()
plt.show()

NMF expressed the digit as a sum of the components

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Import NMF
from sklearn.decomposition import NMF

# Define function to show component as image:
def show_as_image(sample):
bitmap = sample.reshape((13, 8))
plt.figure()
plt.imshow(bitmap, cmap = 'gray', interpolation = 'nearest')
plt.colorbar()
plt.show()

# Create an NMF model: model
model = NMF(7)

# Apply fit_transform to samples: features
features = model.fit_transform(samples)

# Call show_as_image on each component
for component in model.components_:
show_as_image(component)

# Assign the 0th row of features: digit_features
digit_features = features[0, :]

# Print digit_features
print(digit_features)

If we do the same thing for PCA, we would notice that the components of PCA do not represent meaningful parts of images of LED digits

1
2
3
4
5
6
7
8
9
10
11
12
# Import PCA
from sklearn.decomposition import PCA

# Create a PCA instance: model
model = PCA(7)

# Apply fit_transform to samples: features
features = model.fit_transform(samples)

# Call show_as_image on each component
for component in model.components_:
show_as_image(component)

Building Recommender System

Suppose your task is to recommend articles similar to article being read by customer.

Since similar articles should have similar topics, we can apply NMF to the word-frequency array, and use the resulting NMF features, whose values describe the topics. Therefore, similar documents should have similar NMF feature values.

  • Problem: However, although different versions of the same documents have similar topic proportions, the exact feature values may be different.
    • E.g. one version may use many meaningless words to convey the same thing.
  • Solution:
    • On a scatter plot, all these similar versions of the same document lie on a single line passing through the origin.
    • Therefore, when comparing two documents, it’s a good idea to compare these lines.
    • We can achieve that by looking at the cosine similarity, which uses the angle between the lines.
      • Higher values indicate greater similarity
      • max = 1

In this exercise and the next, you’ll use what you’ve learned about NMF to recommend popular music artists! You are given a sparse array artists whose rows correspond to artists and whose column correspond to users. The entries give the number of times each artist was listened to by each user.

In this exercise, build a pipeline and transform the array into normalized NMF features. The first step in the pipeline, MaxAbsScaler, transforms the data so that all users have the same influence on the model, regardless of how many different artists they’ve listened to. In the next exercise, you’ll use the resulting normalized NMF features for recommendation!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Perform the necessary imports
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline

# Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()

# Create an NMF model: nmf
nmf = NMF(20)

# Create a Normalizer: normalizer
normalizer = Normalizer()

# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)

# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists)

# Import pandas
import pandas as pd

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index = artist_names)

# Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen', :]

# Compute cosine similarities: similarities
similarities = df.dot(artist)

# Display those with highest cosine similarity
print(similarities.nlargest())