Learn how to cluster, transform, visualize, and extract insights from unlabeled datasets using scikit-learn
and scipy
(DataCamp).
Unsupervised learning finds patterns in data, but without a specific prediction task in mind.
- e.g. clustering customers by their purchase patterns
Clustering
K-means clustering
- Finds clusters of samples
- Number of clusters must be specified
- New samples can be assigned to existing clusters
- k-means remembers the mean of each cluster (the “centroids”)
- Finds the nearest centroid to each new sample
Performance Evaluation
Possible methods:
- Compared to the existing labels using a cross table (if there are).
In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.
You have the array samples of grain samples, and a list varieties giving the grain variety for each sample.
1 | # Create a KMeans model with 3 clusters: model |
- Measures how spread out the clusters are (Inertia)
- Inertia: Distance from each sample to centroid of its cluster
- It is available after
fit()
as attributeinertia_
- A good clustering has tight clusters, which means the inertia is low, but also doesn’t have too many clusters.
- k-means attemps to minimize the inertia when choosing clusters, and more clusters means lower inertia.
- What is the best neumber of clusters?
- A good rule of thumb is to choose the elbow in the inertia plot, where the inertia begins to decrease more slowly.
- What is the best neumber of clusters?
1 | ks = range(1, 6) |
Improve Performance
Transforming features
Clustering does not work well on features that have very different variances, since in K-means, feature variance = feature influence.
To solve this problem, we can use StandardScaler
to transform each feature to have mean 0 and variance 1.
1 | # Perform the necessary imports |
MaxAbsScaler
and Normalizer
are other scalers available in the library.
Note that Normalizer()
is different to StandardScaler()
. While StandardScaler()
standardizes features by removing the mean and scaling to unit variance, Normalizer()
rescales each sample independently of the other. Normalization typically means rescales the values into a range of [0,1]
. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
1 | # Perform the necessary imports |
Exploratory Visualizations
t-SNE
- t-SNE = “t-distributed stochastic neighbor embedding”
- Maps samples to 2D space (3D)
- Map approximately preserves nearness of samples
- Great for inspecting datasets
Technical details about t-SNE:
- Only has a
fit_transform()
method, which simultaneously fits and transforms the data.- Does not have
fit()
andtransform()
, which means it cannot extend the map to include new data points, and must start over each time.
- Does not have
- Learning rate is importance in delivering a good model
- Try values between 50 and 200
- Wrong choice will lead to points bunch together
- The axes of the t-SNE plot do not have any interpretable meaning, and are different every time t-SNE is applied, even on the same data.
- However, the clusters’ positions relative to each other are the same.
1 | # Import TSNE |
Or to quickly extract insights from high-dimensional data (see how some samples are close to each other)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22# Import TSNE
from sklearn.manifold import TSNE
# Create a TSNE instance: model
model = TSNE(learning_rate = 50)
# Apply fit_transform to normalized_movements: tsne_features
tsne_features = model.fit_transform(normalized_movements)
# Select the 0th feature: xs
xs = tsne_features[:, 0]
# Select the 1th feature: ys
ys = tsne_features[:, 1]
# Scatter plot
plt.scatter(xs, ys, alpha = 0.5)
# Annotate the points
for x, y, company in zip(xs, ys, companies):
plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
plt.show()
Hierachical Clustering
Arranges samples into hierachical clusters.
Dendrogram
It groups the samples into larger and larger clusters, and should be read from bottom up. Each vertical line represent a cluster, and a joining of lines indicates a merging of clusters.
- Steps for “agglomerative” hierarchical clustering:
- Every country begins in a separate cluster
- At each step, the two closest clusters are merged
- Continue until all countries in a single cluster
- “Divisive clustering” works the other way around.
1 | # Perform the necessary imports |
- Height on dendrogram (y-axis) = distance between merging clusters.
- If we specify the max value of the height, we can ask the model to stop merging clusters once the distance reached the specified value.
- This is specified by the
'method'
parameter inlinkage()
.- In complete linkage, the distance between clusters is the distance between the furthest points of the clusters.
- In single linkage, the distance between clusters is the distance between the closest points of the clusters.
Using different method will create different dendrogram on the same dataset.
We can also extract the cluster labels using the fcluster
method.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
# Calculate the linkage: mergings
mergings = linkage(samples, method = 'single')
# Plot the dendrogram
dendrogram(
mergings,
labels = country_names,
leaf_rotation = 90,
leaf_font_size = 6
)
plt.show()
# Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster
# Use fcluster to extract labels: labels
## speficy a maximum height of 6
labels = fcluster(mergings, 6, criterion = 'distance')
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])
# Display ct
print(ct)
Dimension Reduction
Basics:
- More efficient storage and computation
- Remove less-informative “noise” features, which cause problems for prediction tasks, e.g. classification, regression
PCA
PCA = “Principal Component Analysis”, which is a fundamental dime nsion reduction technique. It learns the “pricipal components” of the data, which are the directions that the samples vary the most.
Steps:
- Decorrelation
- Rotate data samples to be alinged with axes (decorrelates the data), which are also the direction of the principal components.
- Note that we are talking about the axes of the point cloud here, not the actual x and y axis on the plot.
- Shift data samples so they have mean 0
- No information is lost
- Rotate data samples to be alinged with axes (decorrelates the data), which are also the direction of the principal components.
- Reduces Dimension
- Discard the low variance “noisy” features, and preserve only the high variance “informative” features.
- We only need to specify the numbers of PCs to execute the redutcion.
- Decorrelation
Technical Details
- PCA is a scikit-learn component like
KMeans
orStandardScaler
fit()
learns the transformation from given datatransform()
applies the learned transformation, and can also be applied to new data
- PCs are available as
components_
attribute of PCA object
- PCA is a scikit-learn component like
Step 1: Decorrelation
1 | # Import PCA |
Step 2: Dimension Reduction
- Intrinsic Dimension
- Intrinsic dimension = number of features needed to approximate the dataset. It helps to answer “What is the most compact representation of the samples?“
- PCA identifies intrinsic dimension, which is the number of PCA features with significant variance
- There is not always one correct answer on choosing the number of PCs.
1 | # Make a scatter plot of the untransformed points |
Variance of each PC1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
# Create scaler: scaler
scaler = StandardScaler()
# Create a PCA instance: pca
pca = PCA()
# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca)
# Fit the pipeline to 'samples'
pipeline.fit(samples)
# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()
Performing reduction1
2
3
4
5
6
7
8
9
10
11
12
13
14# Import PCA
from sklearn.decomposition import PCA
# Create a PCA model with 2 components: pca
pca = PCA(n_components = 2)
# Fit the PCA instance to the scaled samples
pca.fit(scaled_samples)
# Transform the scaled samples: pca_features
pca_features = pca.transform(scaled_samples)
# Print the shape of pca_features
print(pca_features.shape)
Special Case
Sometimes, PCA cannot work on a particular type of dataset, like the tf-idf word frequency arrays.
For this, use the TfidfVectorizer
from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix
. It has fit()
and transform()
methods like other sklearn objects.
1 | # Import TfidfVectorizer |
This array is “sparse”, because most entries are zero. We can use scipy.sparse.csr_matrix
instead of NumPy array. Since csr_matrix
remembers only the non-zero entries, it saves lots of memory space.
However, scikit-learn PCA doesn’t support csr_matrix
. We use scikit-learn TruncatedSVD
instead, which performs the same transformation as PCA.
1 | # Perform the necessary imports |
NMF
NMF = “non-negative matrix factorization”, is also a dimension reduction technique. Compared with PCA, NMF models are interpretable. However, all sample features must be non-negative (>=0). It achieves its interpretability by decomposing samples as sums of their parts.
Basics:
- NMF has components, just like PCA has principal components.
- Dimension of components = number of features in each sample
- Reconstruction of sample:
nmf_features * components
= original sample (product of matrices), which can me performed by@
in python 3.5- This is the “Matrix Factorization” in NMF
Technical details:
- Follows
fit()
/transform()
pattern - Must specify number of components e.g.
NMF(n_components = 2)
- Works with
NumPy
arrays and withcsr_matrix
1 | # Import NMF |
Notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. In the next video, you’ll see why: NMF components represent topics (for instance, acting!).
Word Frequencies
NMF decomposes documents as combinations of common topics (or “themes”).
- Suppose we now have a word frequency array, 4 words, many documents.
- Measure presence of words in each document using “tf-idf”
- “tf” = frequency of word in document
- e.g. If 10% of the words are “datacamp”, then the “tf” of the word “datacamp” will be 0.1
- “idf” = a weighting scheme that reduces the influence of frequent words, like “the”
- “tf” = frequency of word in document
- NMF components represent topics, and NMF features combine topics into documents
1 | # Import NMF |
Previously, you saw that the 3rd NMF feature value was high for the articles about actors Anne Hathaway and Denzel Washington. In this exercise, identify the topic of the corresponding NMF component.
The NMF model you built earlier is available as model, while words is a list of the words that label the columns of the word-frequency array.
1 | # Import pandas |
Encoded Images
NMF decomposes images as combinations of common patterns.
- NMF components represents patterns that frequently occur in the images, and NMF features combine patterns into a whole image.
- “Grayscale” image = no colors, only shades of gray.
- Since there are only shades of grey, it can be encoded by the brightness of every pixel.
- Brightness $\in[0,1]$, with 0 as black, 1 as white.
- Therefore, we can convert a 2D image to a 2D array, with each number representing one pixel.
- We can then flatten the arrays by enumerate the entries:
- Read the 2D array row-by-row and from left to right.
- Then, for a collection of images of the same size, we can further encode them as a 2D array:
- Each row corresponds to an image
- Each column corresponds to a location of a pixel
- Since there are only shades of grey, it can be encoded by the brightness of every pixel.
- The entries are all non-negative, and thus can apply NMF.
1 | # Import pyplot |
NMF expressed the digit as a sum of the components1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26# Import NMF
from sklearn.decomposition import NMF
# Define function to show component as image:
def show_as_image(sample):
bitmap = sample.reshape((13, 8))
plt.figure()
plt.imshow(bitmap, cmap = 'gray', interpolation = 'nearest')
plt.colorbar()
plt.show()
# Create an NMF model: model
model = NMF(7)
# Apply fit_transform to samples: features
features = model.fit_transform(samples)
# Call show_as_image on each component
for component in model.components_:
show_as_image(component)
# Assign the 0th row of features: digit_features
digit_features = features[0, :]
# Print digit_features
print(digit_features)
If we do the same thing for PCA, we would notice that the components of PCA do not represent meaningful parts of images of LED digits1
2
3
4
5
6
7
8
9
10
11
12# Import PCA
from sklearn.decomposition import PCA
# Create a PCA instance: model
model = PCA(7)
# Apply fit_transform to samples: features
features = model.fit_transform(samples)
# Call show_as_image on each component
for component in model.components_:
show_as_image(component)
Building Recommender System
Suppose your task is to recommend articles similar to article being read by customer.
Since similar articles should have similar topics, we can apply NMF to the word-frequency array, and use the resulting NMF features, whose values describe the topics. Therefore, similar documents should have similar NMF feature values.
- Problem: However, although different versions of the same documents have similar topic proportions, the exact feature values may be different.
- E.g. one version may use many meaningless words to convey the same thing.
- Solution:
- On a scatter plot, all these similar versions of the same document lie on a single line passing through the origin.
- Therefore, when comparing two documents, it’s a good idea to compare these lines.
- We can achieve that by looking at the cosine similarity, which uses the angle between the lines.
- Higher values indicate greater similarity
- max = 1
In this exercise and the next, you’ll use what you’ve learned about NMF to recommend popular music artists! You are given a sparse array artists
whose rows correspond to artists and whose column correspond to users. The entries give the number of times each artist was listened to by each user.
In this exercise, build a pipeline and transform the array into normalized NMF features. The first step in the pipeline, MaxAbsScaler
, transforms the data so that all users have the same influence on the model, regardless of how many different artists they’ve listened to. In the next exercise, you’ll use the resulting normalized NMF features for recommendation!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34# Perform the necessary imports
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline
# Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()
# Create an NMF model: nmf
nmf = NMF(20)
# Create a Normalizer: normalizer
normalizer = Normalizer()
# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)
# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists)
# Import pandas
import pandas as pd
# Create a DataFrame: df
df = pd.DataFrame(norm_features, index = artist_names)
# Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen', :]
# Compute cosine similarities: similarities
similarities = df.dot(artist)
# Display those with highest cosine similarity
print(similarities.nlargest())