DataCamp - Cleaning Data in Python

Datacamp course notes on data cleaning.

Common Data Problems

  • Inconsistent column names (capitalization)
  • Missing data
  • Outliers
  • Duplicate rows (can bias analysis and should be dropped)
  • Untidy
  • Need to process between colums
  • Column types can signal unexpected data values
1
2
3
4
5
6
import pandas as pd
df.head()
df.tail()
df.columns #returns column names
df.shape #returns dimension
df.info() #additional info about the df

Exploratory Data Analysis

Frequency count

value_counts() is a method we used here to count the number of unique values in each column in descending order

1
2
3
4
5
6
df.info() #knows data type, NAs, column names

#Before this step, make sure there is no special character in the colnames. (here continent is the colname)
df.continent.value_counts(dropna = False) #dropna = False will also counts the number of na values, if any.

df.['continent'].value_counts(dropna = False)

Summary Statistics

On numeric data, we can use df.describe() to easily get the count, mean, std, and 5 important numbers

When the difference between max and min is rather large, it’s good to plot the data on a log scale.

Visual EDA

Easily display the abnormality.
To look at the frequencies:

  • Boxplots for discrete data counts
    1
    2
    3
    4
    # x = continent, y = population
    df.boxplot(column = 'population',
    by = 'continent', rot = 90)#rotate the x labels for 90 degrees
    plt.show()

Outliers are points showed beyond or below the whiskers

  • Histogram for continuous data counts
1
2
3
4
5
6
7
8
9
10
11
12
import matplotlib.pyplot as plt
df.population.plot('hist')

#customizing
df['Existing Zoning Sqft'].plot(kind = 'hist',
rot = 70, #rotate the x labels for 70 degrees
logx = True, #log x axis
logy = True) #log y axis
plt.show()

# slice out the outlier
df[df.population > 1000000000]

Notice that not all outliers are errors, some may be valid.

Scatter plots are used to capture the relationship between 2 numeric variables and flag potentially bad data that cannot be found by looking at 1 variable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt

# Create and display the first scatter plot
df.plot(kind = 'scatter',
x = 'initial_cost',
y='total_est_fee',
rot=70)
plt.show()

# Create and display the second scatter plot
df_subset.plot(kind = 'scatter',
x = 'initial_cost',
y='total_est_fee',
rot=70)
plt.show()

Tidy Data

Principles of tidy data:

  1. Columns represent separate variables
  2. Rows represent individual observations
  3. Observational units form tables

There are data formats that are better for reporting and formats that are better for analysis.

Melting Data

Here, we try to fix the problem that column names containing values instead of variables with pd.melt(). In melting, we turn columns into rows.

1
2
3
4
5
pd.melt(frame = df, 
id_vars = 'name', #columns that you don't want to melt
value_vars = ['treetment a', 'treatment b'],# if this parameter is not specified, melt will use all the columns except the `id_vars`
var_name = 'treatment', #colname for value vars
value_name = 'result') #colname for values

Pivoting Data

On the contrast, we can use pivot() method to turn unique values into separate columns, especially when multiple variables are stored in the same column.

1
2
3
weather_tidy = weather.pivot(index = 'date', #column that you don't want to pivot
columns = 'element',#pivoting column
values = 'value')

Sometimes there are duplicate entries with different value, and will lead to error when pivoting. Here, we will use pivot table method, which has a parameter that specifies how to deal with duplicate values. e.g. it can aggregate the duplicate values by taking their averages.

1
2
3
4
weather_tidy = weather.pivot_table(index = 'date', #column that you don't want to pivot
columns = 'element',#pivoting column
values = 'value',
aggfunc = np.mean)# the function used to aggregate duplicates

To deal with hierarchical index in df (they allow you to group columns or rows by another variable), we can use .reset_index() method to reset its index.

1
airquality_pivot = airquality_pivot.reset_index()

Parsing Out Data into Separate Columns

Parsing out multiple values in one string into separate columns:

1
df['sex'] = df.variable.str[0] #extract the first character of the variable's value to form a new column 'sex'

If there is a delimiter between the values that we want, we can use .split() to split them into separate lists and then assigning them to different columns with .get()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars = ['Date', 'Day'], var_name = 'type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

# Print the head of ebola_melt
print(ebola_melt.head())

Combining Data for Analysis

Concatenating data

Since data may not always come in 1 huge file, we sometimes need to combine them and then clean the data, or vice versa.

We cam easily concat dfs with pd.concat(). If we don’t specify ignore_index = True, the row index still remain the same as they were in the separate dfs, which will affect your future slicing based in index by .loc.

1
2
3
4
5
# Combining rows
concatenated = pd.concat([weather_p1, weather_p2], ignore_index = True)

# Combining columns
ebola_tidy = pd.concat([ebola_melt, status_country], axis = 1) #default setting is axis = 0 for row concatenation, axis = 1 specifies columns concatenation.

Globbing

In order to concatenate DataFrames:

  • They must be in a list
  • can individually load if there are a few datasets

When there are too many files to concatenate, we can use the glob function to find files based on a pattern. Globbing is simple way for python to do pattern matching for file names. We can use various wildcards like * and ? to specify a file name pattern we are looking for.

A wildcard is a symbol that will match any arbitrary number of characters.

  • * match any string. e.g. *.csv matches any csv files
  • ? only allows us to match one character e.g. file_?.csv matches file_a.csv, file_b.csv and so on.

Then globbing will return a list of file names, which can be used to load files into separate DataFrames.

1
2
3
4
5
6
7
8
9
import glob
import pandas as pd
csv_files = glob.glob('*.csv') #returns a list of file names
list_data = []
for filename in csv_files:
data = pd.read_csv(filename)
list_data.append(data)
#returns a list of dataframes
pd.concat(list_data) #concat the list of df into a single df

Merging data

Concatenation can only happen when row or column orders are the same among dataframes. When the order is different, we need to combine the data by merging, which is very similar to joining tables in SQL. Merging will combine disparate datasets based on common columns

1
2
3
4
5
6
import pandas as pd
pd.merge(left = state_populations, #left df
right = stat_codes, #right df
on = None, #speficy when the key column name is the same
left_on = 'state', #left df's key column name
right_on = 'name') #right df's key column name

Types of merges

  1. one-one-one merge: There is no duplicate values in the key column
  2. one-to-many/many-to-one merge: duplicate values in the key column
  3. many-to-many: when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

Cleaning Data for Analysis

Data Type

Dataframe’s attribute df.dtypes or function df.info() can be called to understand each column’s data type. We can use .astype() function to convert one data type to another.

1
2
3
4
5
6
df['treatment b'] = df['treatment b'].astype(str)#to string

df['sex'] = df['sex'].astype('category')#categorical variable
df.sex = df.sex.astype('category') #same as above

df['treatment a'] = pd.to_numeric(df['treatment a'], errors = 'coerce') #to numeric, and coerce non-convertable value into null

Categorical dtype has several advantages:

  1. make df smaller in memory
  2. make them utilizable by other Python libraries for analysis

Regular Expression to Match Strings

Match pattern by regular expressions from re library:

  1. \d* match any number with any digit e.g. 17
  2. \$\d* -> escape the dollar sign with \. The dollar sign matches the end of a string if not escaped. e.g. $17
  3. \$\d*\.\d* escape the dollar sign and the dot. The dot . matches any one character if not escaped/ e.g. $17.00 or $17.000
  4. \$\d*\.\d{2} only two digits after the decimal point. e.g. $17.89
  5. ^\$\d*\.\d{2}$ specifies to start the pattern match at the beginning of the value and tell the pattern to match as the end of the value. This way, it will only do exact matching. Otherwise, string like “I have 17.89 USD” will also be matched.
  6. \d{3}-\d{3}-\d{4} -> ‘123-456-7890’
  7. [A-Z]\w* -> ‘Australia’. [A-Z] matchs any one capital letter. \w* match an arbitrary number of alphanumeric characters.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    import re
    pattern = re.complie('\$\d*\.\d{2}')#save the pattern

    # Pass the string we want to match, and this will return a match object
    result = pattern.match('$17.89')

    # Convert the match object into a boolean, which returns True
    bool(result)

    # Find the numeric values: matches
    matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana') #`+` is used so that the previous element is matched one or more times. This ensures that 10 is viewed as one number and not as 1 and 0.

    # Print the matches
    print(matches) #['10', '1']

Using Functions

apply() function can be used to perform functions quickly among rows or columns

1
2
df.apply(np.mean, axis = 0) #apply the mean function across columns
df.apply(np.mean, axis = 1) #apply the mean function across each row

To utilize apply() in an actual case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import re
from numpy import NaN
pattern = re.compile('^\$\d*\.\d{2}$')

def diff_money(row, pattern):
icost = row['Initial Cost']
tef = row['Total Est. Fee']

if bool(pattern.match(icost)) and bool(pattern.match(tef)):
icost = icost.replace('$', '')
tef = tef.replace('$', '')
icost = float(icost)
tef = float(tef)

return icost - tef
else:
return(NaN) #or return np.nan if only np is imported

df_subset['diff'] = df_subset.apply(diff_money, axis = 1, pattern = pattern) #replace the need of a for loop. Here, row = df_subset.

Using lambda function to simplify the process:

1
2
3
4
5
6
7
8
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', '')) #delete the dollar sign

# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0]) #extract only the number

# Print the head of tips
print(tips.head()) #the above two approach yield the same results

Duplicate and Missing Data

For duplicate rows, .drop_duplicates() method can easily delete them.

For missing data, .info() can be used to check for the existence of missing values in each column in a dataframe. There are several ways to deal with them:

  1. Leave them as they were
  2. Drop them. .dropna() can be used to drop all the rows contains a missing value.
  3. Filling the missing values. we can use .fillna() to fill in the value we want. We can also fill the missing values with a test statistic, such as a median (good in presence of outliers) or a mean.
    1
    2
    3
    4
    5
    6
    7
    8
    # Fill the missing value with the string 'missing' 
    tips_nan['sex'] = tips_nan['sex'].fillna('missing')

    # Fill the missing value in two columns with 0
    tips_nan[['total_bill', 'size']] = tips_nan[['total_bill', 'size']].fillna(0)

    # Fill the missing value with the mean of the column
    tips_nan['tips'] = tips_nan['tip'].fillna(tips_nan['trip'].mean())

Testing with Assert

We can use assert statement to programmatically check our data. For example, after we drop or fill the NaNs, we expect no missing values. We can write an assert statement to verify this. In this way, we can detect early warnings and errors before conducting analysis on our data.

1
2
3
4
5
6
7
8
9
10
11
assert 1 == 1 #returns nothing since it's true
assert 1 == 2 #returns an AssertionError

#`.notnull()` check whether the value is null or not, `.all()` apply the `.notnull()` to all the values in the 'Close' column
assert google.Close.notnull().all()

# Assert that there are no missing values
assert pd.notnull(ebola).all().all() #The first `.all()` method will return a True or False for each column, while the second `.all()` method will return a single True or False.

# Assert that all values are >= 0
assert (ebola >= 0).all().all()

Case Study

Gapminder data: consists of life expectancy by country and year.
Let’s first sum up what we learnt so far in this course:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Useful methods
import pandas as pd
df = pd.read_csv('my_data.csv')
df.head()
df.info()
df.columns
df.describe()
df.column.value_counts()
df.column.plot('hist')

# For data quality
def cleaning_function(row_data):
#data cleaning steps
return ...
df.apply(cleaning_function, axis = 1)
assert (df.column_data > 0).all()

# Combining data
pd.merge(df1, df2, ...)
pd.concat([df1, df2, df3, ...])

# Data type
df.dtypes
df['column'] = df['column'].to_numeric()
df['column'] = df['column'].astype(str)
df['new_column'] = df['column_1'] + df['column_2']
df['new_column'] = df.apply(my_function, axis = 1) #row

# Save the file
df.to_csv['file_name.csv']

Exploratory Analysis

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# First touch
g1800s.info()
g1800s.head()
g1800s.describe() #summary statistics column wise

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Create the scatter plot
g1800s.plot(kind = 'scatter', x = '1800', y = '1899')

# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')

# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)

# Display the plot
plt.show()

There are a surprising number of countries that fall on the diagonal line in the scatter plot. In fact, examining the DataFrame reveals that the life expectancy for 140 of the 260 countries did not change at all in the 19th century! This is possibly a result of not having access to the data for all the years back then. In this way, visualizing your data can help you uncover insights as well as diagnose it for errors.

Checking Assumptions

Since you are given life expectancy level data by country and year, you could ask questions about how much the average life expectancy changes over each year.

Before continuing, however, it’s important to make sure that the following assumptions about the data are true:

  • ‘Life expectancy’ is the first column (index 0) of the DataFrame.
  • The other columns contain either null or numeric values.
  • The numeric values are all greater than or equal to 0.
  • There is only one instance of each country.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def check_null_or_valid(row_data):
"""Function that takes a row of data,
drops all missing values,
and checks if all remaining values are greater than or equal to 0
"""
no_na = row_data.dropna()[1:-1] #drop all rows with na
numeric = pd.to_numeric(no_na) #convert all values to numeric
ge0 = numeric >= 0 #confirm all values are greater than or equal to 0
return ge0

# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == 'Life expectancy'

# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()

# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1 #this method counts for each unique value in the column, and rank the values based on the counts from high to low. Therefore, if the first counts = 1, then there's no duplicates in the column.

Assembling the data

1
2
3
4
5
6
7
8
# Concatenate the DataFrames row-wise
gapminder = pd.concat([g1800s, g1900s, g2000s])

# Print the shape of gapminder
print(gapminder.shape)

# Print the head of gapminder
print(gapminder.head())

All the Gapminder data, from 1800 to 2016, is now contained in one DataFrame.

Reshaping the data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(frame = gapminder,
id_vars = 'Life expectancy')

# Rename the columns
gapminder_melt.columns = ['country', 'year', 'life_expectancy']

# Print the head of gapminder_melt
print(gapminder_melt.head())

# Convert the year column to numeric
gapminder.year = pd.to_numeric(gapminder.year)

# Test if country is of type object
assert gapminder.country.dtypes == np.object

# Test if year is of type int64
assert gapminder.year.dtypes == np.int64

# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64

Data correctness

Having tidied your DataFrame and checked the data types, your next task in the data cleaning process is to look at the 'country' column to see if there are any special or invalid characters you may need to deal with.

It is reasonable to assume that country names will contain:

  • The set of lower and upper case letters.
  • Whitespace between words.
  • Periods for any abbreviations.

To confirm that this is the case, you can leverage the power of regular expressions again. For common operations like this, Python has a built-in string method - str.contains() - which takes a regular expression pattern, and applies it to the Series, returning True if there is a match, and False otherwise.

Since here you want to find the values that do not match, you have to invert the boolean, which can be done using ~. This Boolean series can then be used to get the Series of countries that have invalid names.

Write a regular expression that tests your assumptions of what characters belong in countries:

  • Anchor the pattern to match exactly what you want by placing a ^ in the beginning and $ in the end.
  • Use A-Za-z to match the set of lower and upper case letters, \. to match periods, and \s to match whitespace between words.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Create the series of countries: countries
countries = gapminder.country

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
mask_inverse = ~mask

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
print(invalid_countries)

# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()

# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()

# Drop the missing values
gapminder = gapminder.dropna(axis = 0, how = 'any') #default setting

# Print the shape of gapminder
print(gapminder.shape)

Data aggregation and visualization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Add first subplot
plt.subplot(2, 1, 1)

# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind = 'hist')

# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()

# Print the head of gapminder_agg
print(gapminder_agg.head())

# Print the tail of gapminder_agg
print(gapminder_agg.tail())

# Add second subplot
plt.subplot(2, 1, 2)

# Create a line plot of life expectancy per year
gapminder_agg.plot(kind = 'line') #default setting

# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')

# Display the plots
plt.tight_layout()
plt.show()

# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')

Looking at the line plot, it seems like life expectancy has, as expected, increased over the years. There is a surprising dip around 1920 that may be worth further investigation!