Datacamp course notes on data cleaning.
Common Data Problems
- Inconsistent column names (capitalization)
- Missing data
- Outliers
- Duplicate rows (can bias analysis and should be dropped)
- Untidy
- Need to process between colums
- Column types can signal unexpected data values
1 | import pandas as pd |
Exploratory Data Analysis
Frequency count
value_counts()
is a method we used here to count the number of unique values in each column in descending order1
2
3
4
5
6df.info() #knows data type, NAs, column names
#Before this step, make sure there is no special character in the colnames. (here continent is the colname)
df.continent.value_counts(dropna = False) #dropna = False will also counts the number of na values, if any.
df.['continent'].value_counts(dropna = False)
Summary Statistics
On numeric data, we can use df.describe()
to easily get the count, mean, std, and 5 important numbers
When the difference between max
and min
is rather large, it’s good to plot the data on a log scale.
Visual EDA
Easily display the abnormality.
To look at the frequencies:
- Boxplots for discrete data counts
1
2
3
4# x = continent, y = population
df.boxplot(column = 'population',
by = 'continent', rot = 90)#rotate the x labels for 90 degrees
plt.show()
Outliers are points showed beyond or below the whiskers
- Histogram for continuous data counts
1 | import matplotlib.pyplot as plt |
Notice that not all outliers are errors, some may be valid.
Scatter plots are used to capture the relationship between 2 numeric variables and flag potentially bad data that cannot be found by looking at 1 variable.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt
# Create and display the first scatter plot
df.plot(kind = 'scatter',
x = 'initial_cost',
y='total_est_fee',
rot=70)
plt.show()
# Create and display the second scatter plot
df_subset.plot(kind = 'scatter',
x = 'initial_cost',
y='total_est_fee',
rot=70)
plt.show()
Tidy Data
Principles of tidy data:
- Columns represent separate variables
- Rows represent individual observations
- Observational units form tables
There are data formats that are better for reporting and formats that are better for analysis.
Melting Data
Here, we try to fix the problem that column names containing values instead of variables with pd.melt()
. In melting, we turn columns into rows.1
2
3
4
5pd.melt(frame = df,
id_vars = 'name', #columns that you don't want to melt
value_vars = ['treetment a', 'treatment b'],# if this parameter is not specified, melt will use all the columns except the `id_vars`
var_name = 'treatment', #colname for value vars
value_name = 'result') #colname for values
Pivoting Data
On the contrast, we can use pivot()
method to turn unique values into separate columns, especially when multiple variables are stored in the same column.1
2
3weather_tidy = weather.pivot(index = 'date', #column that you don't want to pivot
columns = 'element',#pivoting column
values = 'value')
Sometimes there are duplicate entries with different value, and will lead to error when pivoting. Here, we will use pivot table method, which has a parameter that specifies how to deal with duplicate values. e.g. it can aggregate the duplicate values by taking their averages.1
2
3
4weather_tidy = weather.pivot_table(index = 'date', #column that you don't want to pivot
columns = 'element',#pivoting column
values = 'value',
aggfunc = np.mean)# the function used to aggregate duplicates
To deal with hierarchical index in df (they allow you to group columns or rows by another variable), we can use .reset_index()
method to reset its index.1
airquality_pivot = airquality_pivot.reset_index()
Parsing Out Data into Separate Columns
Parsing out multiple values in one string into separate columns:1
df['sex'] = df.variable.str[0] #extract the first character of the variable's value to form a new column 'sex'
If there is a delimiter between the values that we want, we can use .split()
to split them into separate lists and then assigning them to different columns with .get()
1
2
3
4
5
6
7
8
9
10
11
12
13
14# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars = ['Date', 'Day'], var_name = 'type_country', value_name='counts')
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)
# Print the head of ebola_melt
print(ebola_melt.head())
Combining Data for Analysis
Concatenating data
Since data may not always come in 1 huge file, we sometimes need to combine them and then clean the data, or vice versa.
We cam easily concat dfs with pd.concat()
. If we don’t specify ignore_index = True
, the row index still remain the same as they were in the separate dfs, which will affect your future slicing based in index by .loc
.
1 | # Combining rows |
Globbing
In order to concatenate DataFrames:
- They must be in a list
- can individually load if there are a few datasets
When there are too many files to concatenate, we can use the glob
function to find files based on a pattern. Globbing is simple way for python to do pattern matching for file names. We can use various wildcards like *
and ?
to specify a file name pattern we are looking for.
A wildcard is a symbol that will match any arbitrary number of characters.
*
match any string. e.g.*.csv
matches any csv files?
only allows us to match one character e.g.file_?.csv
matchesfile_a.csv
,file_b.csv
and so on.
Then globbing will return a list of file names, which can be used to load files into separate DataFrames.
1 | import glob |
Merging data
Concatenation can only happen when row or column orders are the same among dataframes. When the order is different, we need to combine the data by merging, which is very similar to joining tables in SQL. Merging will combine disparate datasets based on common columns1
2
3
4
5
6import pandas as pd
pd.merge(left = state_populations, #left df
right = stat_codes, #right df
on = None, #speficy when the key column name is the same
left_on = 'state', #left df's key column name
right_on = 'name') #right df's key column name
Types of merges
- one-one-one merge: There is no duplicate values in the key column
- one-to-many/many-to-one merge: duplicate values in the key column
- many-to-many: when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.
Cleaning Data for Analysis
Data Type
Dataframe’s attribute df.dtypes
or function df.info()
can be called to understand each column’s data type. We can use .astype()
function to convert one data type to another.1
2
3
4
5
6df['treatment b'] = df['treatment b'].astype(str)#to string
df['sex'] = df['sex'].astype('category')#categorical variable
df.sex = df.sex.astype('category') #same as above
df['treatment a'] = pd.to_numeric(df['treatment a'], errors = 'coerce') #to numeric, and coerce non-convertable value into null
Categorical dtype has several advantages:
- make df smaller in memory
- make them utilizable by other Python libraries for analysis
Regular Expression to Match Strings
Match pattern by regular expressions from re
library:
\d*
match any number with any digit e.g. 17\$\d*
-> escape the dollar sign with\
. The dollar sign matches the end of a string if not escaped. e.g. $17\$\d*\.\d*
escape the dollar sign and the dot. The dot.
matches any one character if not escaped/ e.g. $17.00 or $17.000\$\d*\.\d{2}
only two digits after the decimal point. e.g. $17.89^\$\d*\.\d{2}$
specifies to start the pattern match at the beginning of the value and tell the pattern to match as the end of the value. This way, it will only do exact matching. Otherwise, string like “I have 17.89 USD” will also be matched.\d{3}-\d{3}-\d{4}
-> ‘123-456-7890’[A-Z]\w*
-> ‘Australia’.[A-Z]
matchs any one capital letter.\w*
match an arbitrary number of alphanumeric characters.1
2
3
4
5
6
7
8
9
10
11
12
13
14import re
pattern = re.complie('\$\d*\.\d{2}')#save the pattern
# Pass the string we want to match, and this will return a match object
result = pattern.match('$17.89')
# Convert the match object into a boolean, which returns True
bool(result)
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana') #`+` is used so that the previous element is matched one or more times. This ensures that 10 is viewed as one number and not as 1 and 0.
# Print the matches
print(matches) #['10', '1']
Using Functions
apply()
function can be used to perform functions quickly among rows or columns1
2df.apply(np.mean, axis = 0) #apply the mean function across columns
df.apply(np.mean, axis = 1) #apply the mean function across each row
To utilize apply()
in an actual case:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19import re
from numpy import NaN
pattern = re.compile('^\$\d*\.\d{2}$')
def diff_money(row, pattern):
icost = row['Initial Cost']
tef = row['Total Est. Fee']
if bool(pattern.match(icost)) and bool(pattern.match(tef)):
icost = icost.replace('$', '')
tef = tef.replace('$', '')
icost = float(icost)
tef = float(tef)
return icost - tef
else:
return(NaN) #or return np.nan if only np is imported
df_subset['diff'] = df_subset.apply(diff_money, axis = 1, pattern = pattern) #replace the need of a for loop. Here, row = df_subset.
Using lambda function to simplify the process:1
2
3
4
5
6
7
8# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', '')) #delete the dollar sign
# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0]) #extract only the number
# Print the head of tips
print(tips.head()) #the above two approach yield the same results
Duplicate and Missing Data
For duplicate rows, .drop_duplicates()
method can easily delete them.
For missing data, .info()
can be used to check for the existence of missing values in each column in a dataframe. There are several ways to deal with them:
- Leave them as they were
- Drop them.
.dropna()
can be used to drop all the rows contains a missing value. - Filling the missing values. we can use
.fillna()
to fill in the value we want. We can also fill the missing values with a test statistic, such as a median (good in presence of outliers) or a mean.1
2
3
4
5
6
7
8# Fill the missing value with the string 'missing'
tips_nan['sex'] = tips_nan['sex'].fillna('missing')
# Fill the missing value in two columns with 0
tips_nan[['total_bill', 'size']] = tips_nan[['total_bill', 'size']].fillna(0)
# Fill the missing value with the mean of the column
tips_nan['tips'] = tips_nan['tip'].fillna(tips_nan['trip'].mean())
Testing with Assert
We can use assert statement to programmatically check our data. For example, after we drop or fill the NaNs, we expect no missing values. We can write an assert statement to verify this. In this way, we can detect early warnings and errors before conducting analysis on our data.1
2
3
4
5
6
7
8
9
10
11assert 1 == 1 #returns nothing since it's true
assert 1 == 2 #returns an AssertionError
#`.notnull()` check whether the value is null or not, `.all()` apply the `.notnull()` to all the values in the 'Close' column
assert google.Close.notnull().all()
# Assert that there are no missing values
assert pd.notnull(ebola).all().all() #The first `.all()` method will return a True or False for each column, while the second `.all()` method will return a single True or False.
# Assert that all values are >= 0
assert (ebola >= 0).all().all()
Case Study
Gapminder data: consists of life expectancy by country and year.
Let’s first sum up what we learnt so far in this course:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30# Useful methods
import pandas as pd
df = pd.read_csv('my_data.csv')
df.head()
df.info()
df.columns
df.describe()
df.column.value_counts()
df.column.plot('hist')
# For data quality
def cleaning_function(row_data):
#data cleaning steps
return ...
df.apply(cleaning_function, axis = 1)
assert (df.column_data > 0).all()
# Combining data
pd.merge(df1, df2, ...)
pd.concat([df1, df2, df3, ...])
# Data type
df.dtypes
df['column'] = df['column'].to_numeric()
df['column'] = df['column'].astype(str)
df['new_column'] = df['column_1'] + df['column_2']
df['new_column'] = df.apply(my_function, axis = 1) #row
# Save the file
df.to_csv['file_name.csv']
Exploratory Analysis
1 | # First touch |
There are a surprising number of countries that fall on the diagonal line in the scatter plot. In fact, examining the DataFrame reveals that the life expectancy for 140 of the 260 countries did not change at all in the 19th century! This is possibly a result of not having access to the data for all the years back then. In this way, visualizing your data can help you uncover insights as well as diagnose it for errors.
Checking Assumptions
Since you are given life expectancy level data by country and year, you could ask questions about how much the average life expectancy changes over each year.
Before continuing, however, it’s important to make sure that the following assumptions about the data are true:
- ‘Life expectancy’ is the first column (index 0) of the DataFrame.
- The other columns contain either null or numeric values.
- The numeric values are all greater than or equal to 0.
- There is only one instance of each country.
1 | def check_null_or_valid(row_data): |
Assembling the data
1 | # Concatenate the DataFrames row-wise |
All the Gapminder data, from 1800 to 2016, is now contained in one DataFrame.
Reshaping the data
1 | # Melt gapminder: gapminder_melt |
Data correctness
Having tidied your DataFrame and checked the data types, your next task in the data cleaning process is to look at the 'country'
column to see if there are any special or invalid characters you may need to deal with.
It is reasonable to assume that country names will contain:
- The set of lower and upper case letters.
- Whitespace between words.
- Periods for any abbreviations.
To confirm that this is the case, you can leverage the power of regular expressions again. For common operations like this, Python has a built-in string method - str.contains()
- which takes a regular expression pattern, and applies it to the Series, returning True
if there is a match, and False
otherwise.
Since here you want to find the values that do not match, you have to invert the boolean, which can be done using ~
. This Boolean series can then be used to get the Series of countries that have invalid names.
Write a regular expression that tests your assumptions of what characters belong in countries
:
- Anchor the pattern to match exactly what you want by placing a
^
in the beginning and$
in the end. - Use
A-Za-z
to match the set of lower and upper case letters,\.
to match periods, and\s
to match whitespace between words.
1 | # Create the series of countries: countries |
Data aggregation and visualization
1 | # Add first subplot |
Looking at the line plot, it seems like life expectancy has, as expected, increased over the years. There is a surprising dip around 1920 that may be worth further investigation!