DataCamp - Pandas Foundation 2

Datacamp course notes on pandas including extracting and transforming data, advanced indexing, rearranging and reshaping data, and grouping data.

Extracting and Transforming Data

Indexing DataFrames

Find the row position and column position of a certain (x, y)

1
2
3
4
5
6
7
8
9
import numpy as np
# Assign the row position of election.loc['Bedford']: x
x = np.arange(election.iloc[:, 0].count())[election.index == 'Bedford'][0]

# Assign the column position of election['winner']: y
y = np.arange(election.iloc[0, :].count())[election.columns == 'winner'][0]

# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner']) #returns True

Difference between df['eggs'] and df[['eggs']] is that the former yields a panda series, the latter yields a dataframe.

To slice the row labels in reverse order:

1
2
3
4
5
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter', :]

# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['Potter':'Perry':-1, :]

Filtering DataFrames

&, | can be used to combine conditions for filtering.

Select non-zero entries:

  • df.loc[:, df.all()] exclude all the columns with zero entries
  • df.loc[:, df.any()] exclude all columns with only zero entries

Select non-null entries:

  • df.loc[:, df.isnull().any()] returns any columns that have at least one NaN value
  • df.loc[:, df.notnull().all()] returns all columns that have no Nan values

Drop rows with any NaNs:

  • df.dropna(how = 'any') drops all the rows that have at least one NaN value
  • df.dropna(how = 'all') drops all the rows that have only NaN values

Drop columns with NaNs:

  • titanic.dropna(thresh = 1000, axis = 'columns') drops columns from the full dataset that have more than 1000 missing values

Modifying a column based on another
df.eggs[df.salt > 55] += 5 add 5 to the value in column ‘eggs’ if the value in the column ‘salt’ is greater than 55.

Transforming DataFrames

When transforming the dataframe, we should first try to achieved the desired performance with vectorized functions, since it’s much faster than for-loops. Notice that .apply() and .map() also perform Python for-loops over the data stored in a pandas Series or DataFrame.

To convert sales numbers into dozens unit:

  • df.floordiv(12) will apply to every entries in the dataframe
  • np.floor_divide(df, 12) performs the same function, but is from numpy package

Or if we want to write a function to perform the same function:

1
2
3
4
5
6
def dozens(n):
return n//12 #floor divide
df.apply(dozens)

#or use the lambda function
df.apply(lambda n: n//12)

To transform string:

  • .str.upper()

To transform index, there is no apply method, but map method:

  • df.index.str.upper() or df.index.map(str.upper) is the same

The .map() method is used to transform values according to a Python dictionary look-up. The example below use dictionary to map the values for a new column according to the value of the existing column 'winner'.

1
2
3
4
5
6
7
8
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue', 'Romney':'red'}

# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election.winner.map(red_vs_blue)

# Print the output of election.head()
print(election.head())

To transform the data into zscore:

1
2
3
4
5
6
7
8
9
10
11
# Import zscore from scipy.stats
from scipy.stats import zscore

# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election['turnout'])

# Print the type of turnout_zscore
print(type(turnout_zscore)) #1-d numpy array

# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore

If we cannot performed the desired operations with the vectorized functions, we can always use for loops.


Advanced Indexing

Pandas Data Structures

  • Key building blocks:

    1. Indexes: Sequence of labels. Itself also has a name attribute, which can be accessed and modified through .index.name

      1
      2
      3
      4
      5
      # Assign the string 'MONTHS' to sales.index.name
      sales.index.name = 'MONTHS'

      # Assign the string 'PRODUCTS' to sales.columns.name
      sales.columns.name = 'PRODUCTS'
    2. Series: 1D array with index

    3. DataFrames: 2D array with Series as columns
  • Indexes:
    • Immutable (like dictionary keys). Note: it can be modified only when you change the index all at once
    • Homogeneous in data type (like NumPy arrays)
      1
      2
      3
      4
      5
      # Create the list of new indexes: new_idx
      new_idx = [i.upper() for i in sales.index]

      # Assign new_idx to sales.index
      sales.index = new_idx

Hierarchical indexing

We can use two index columns 'A' and 'B' together to uniquely identify each row by .set_index(['A', 'B']), and then sort the index by .sort_index(). Sorting the index is very useful for indexing and slicing, as is shown in the following exaple

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Creating hierarchical index
stocks = stocks.set_index(['Symbol', 'Date']) #hierarchical order
print(stocks.index.name) #returns None
print(stocks.index.names) #returns ['Symbol', 'Date']

# Sorting index
stocks = stocks.sort_index()

# Indexing
stocks.loc[('CSCO', '2016-10-04')]
stocks.loc[('CSCO', '2016-10-04'), 'Volume']

# Slicing (Outermost index)
stocks.loc['AAPL']
stocks.loc['CSCO':'MSFT']
stocks.loc[(['CSCO':'MSFT'], '2016-10-05'), :]
stocks.loc[(['CSCO':'MSFT'], '2016-10-05'), 'Close']

# Slicing (Innermost index)
stocks.loc[('CSCO', ['2016-10-05', '2016-10-03']), :]

# Slicing (Both indexes)
stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :] #select all symbols for the two days
# We need to use slice() here because the used for the index does not recognize slicing with columns natively.


Rearranging and Reshaping Data

Pivoting DataFrames

Pivoting can extract cell values to column and row names by the .pivot() method.

1
2
3
df.pivot(index = 'A', # column 'A' for row index labels
columns = 'B', # column 'B' for column labels
values = 'C') # column 'C' for value. If we leave out this argument, all remaining columns will be used values

Stacking and Unstacking DataFrames

We can perform stacking and unstacking when the dataframe has a hierarchical index.

Unstacking is to make the data frame shorter and wider, while stacking is to make the data frame longer and thinner.

1
2
3
4
5
6
7
8
9
trials = trials.set_index(['treatment', 'gender'])
trials.unstack(level = 'gender') #to specify the index that we want to move to column names. can create hierarchical columns
trials.unstack(level = 1) #same as above

trials.stack(level = 'gender') #reverse the above operation, and create hierarchical index

# To switch the order of hierarchical index:
swapped = stacked.swaplevel(0, 1)
sorted_trails = swapped.sort_index()

Melting DataFrame

Melting can put column names back into cell values by pd.melt().

1
2
3
4
pd.melt(df, id_vars = ['col_to_be_fixed'], #the square brackets are necessary here
value_vars = ['col_to_be_cell_values'],#the square brackets are necessary here
var_name = 'value_vars_name',
value_name = 'values_name')

Example:

1
2
3
4
5
# Reset the index: visitors_by_city_weekday
visitors_by_city_weekday = visitors_by_city_weekday.reset_index() #so that we can have the original index column separated out as normal column

# Melt visitors_by_city_weekday: visitors
visitors = pd.melt(visitors_by_city_weekday, id_vars = ['weekday'], value_name = 'visitors')

Obtaining key-value pairs with melt()

Sometimes, all you need is some key-value pairs, and the context does not matter. If said context is in the index, you can easily obtain what you want. For example, in the users DataFrame, the visitors and signups columns lend themselves well to being represented as key-value pairs. So if you created a hierarchical index with 'city' and 'weekday' columns as the index, you can easily extract key-value pairs for the 'visitors' and 'signups' columns by melting users and specifying col_level = 0.

1
2
3
4
5
# Set the new index: users_idx
users_idx = users.set_index(['city', 'weekday'])

# Obtain the key-value pairs: kv_pairs
kv_pairs = pd.melt(users_idx, col_level = 0)

Pivot Tables

Pivot tables are useful when the identifier columns are duplicate and the corresponding value needs aggregation to perform pivoting. The default setting is to take the average.

1
2
3
4
5
df.pivot_table(index = 'col_to_be_row_index',
columns = 'col_to_be_column_index',
values = 'col_to_be_values',
aggfunc = 'count', #can also be aggfunc = len (no ''), sum (no '')
margins = True) #shows the sum of each column at the margin

Tips:

  1. To verify df1 is equal to df2: df1.equals(df2)

Grouping Data

Categoricals and Groupby

The logic behind sales.groupby('weekday').count():

  • Split by ‘weekday’
  • Apply count() function on each group
  • Combine counts per group

Groupby also work together with many other reducing functions such as mean(), std(), sum(), first(), last(), min(), max().

1
2
3
4
5
6
7
8
9
10
11
12
13
sales = pd.DataFrame(
{
'weekday':['Sun', 'Sun', 'Mon', 'Mon'],
'city':['Austin', 'Dallas', 'Austin', 'Dallas'],
'bread':[139, 237, 326, 456],
'butter':[20, 45, 70, 98]
}
) #index start from 0
sales.groupby('weekday').count() #count the bread and butter on Mon and Sun respectively

sales.groupby('weekday')[['bread', 'butter']].sum() #sum the bread on Mon and Sun respectively

sales.groupby(['weekday', 'city'])['bread'].sum() #multilevel groupby, index sorted automatically

We can also other panda series as the groupby criteria:

1
2
customers = pd.Series(['Dave', 'Alice', 'Bob', 'Alice']) #same as sales' index, which start from zero
sales.groupby(customers)['bread'].sum() #treat the series as a column in the dataframe fitting corresponding index

Categorical data’s advantages:

  1. Uses less memory
  2. Speeds up operations like groupby()

.unique() method can return all the unique categorical values in a column.
.value_counts() method can return the counts of each unique categorical values.
.astype('category') method can convert a column into categorical

Groupby and Aggregation

Using the .agg() method, we can perform multiple aggregation function at once.

1
sales.groupby('city')[['bread', 'butter']].agg(['max', 'sum'])

The result is displayed using a multi-level column index

Besides the built-in reducing functions, .agg() also accepts defined new functions.

1
2
3
def data_range(series):
return series.max() - series.min()
sales.groupby('weekday')[['bread', 'butter']].agg(data_range)

The .agg() method also accepts dictionaries, which can help specify different aggregation function used for each column.

1
sales.groupby(customers)[['bread', 'butter']].agg({'bread':'sum', 'butter': data_range})

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Group titanic by 'pclass': by_class
by_class = titanic.groupby('pclass')

# Select 'age' and 'fare'
by_class_sub = by_class[['age','fare']]

# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(['max', 'median'])

# Print the maximum age in each class
print(aggregated.loc[:, ('age','max')])

# Print the median fare in each class
print(aggregated.loc[:, ('fare', 'median')])

To use datetime object as groupby criterion:

1
2
3
4
5
6
7
8
9
10
11
# Read file: sales
sales = pd.read_csv('sales.csv', index_col = 'Date', parse_dates = True)

# Create a groupby object: by_day
by_day = sales.groupby(sales.index.strftime('%a')) #by weekday

# Create sum: units_sum
units_sum = by_day['Units'].sum()

# Print units_sum
print(units_sum)

Tips:
.strftime() method can convert datetime object into desired format, such as day of week (%a:abbreviated, %A:full name), day of month (%d), month (%b:abbreviated, %B:full name) and year (%y:abb., %Y:full), etc.

Groupby and Transformation

We often want to group data and apply distinct transformation to distinct group. Instead of aggregating after grouping, we can apply a transformation instead. It changes dataframe entries according to a specified function in place without changing the index. As an example, we can compute the z-score here.

1
2
3
4
 def zscore(series):
return (series - series.mean()) / series.std()
zscore(auto['mpg']).head() #apply the function to the column 'mpg'. In this way, we normalized each value based on all the values in the column
auto.groupby('yr')['mpg'].transform(zscore).head() #using a groupby and transform, we normalized each value based on all the values of the specific year that the value belongs to.

If the desired function to perform is too complicated for .transform(), we can use .apply() instead.

1
2
3
4
5
6
7
8
9
10
def zscore_with_year_and_name(group):
df = pd.DataFrame(
{
'mpg':zscore(group['mpg']),
'year':group['yr'],
'name':group['name']
}
)
return df
auto.groupby('yr').apply(zscore_with_year_and_name).head()

Filling Missing Data (imputation) by Group

Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered. Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior). We’ve practiced using the .dropna() method to drop missing values. Now, we will practice imputing missing values. We can use .groupby() and .transform() to fill missing data appropriately for each group.

In a way, imputing missing values intelligently is always preferrable to dropping them entirely!

Our job is to fill in missing 'age' values for passengers on the Titanic with the median age from their 'gender' and 'pclass'. To do this, you’ll group by the 'sex' and 'pclass' columns and transform each group with a custom function to call .fillna() and impute the median value.

1
2
3
4
5
6
7
8
9
10
11
12
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(['sex', 'pclass'])

# Write a function that imputes median
def impute_median(series):
return series.fillna(series.median())

# Impute age and assign to titanic['age']
titanic.age = by_sex_class.age.transform(impute_median)

# Print the output of titanic.tail(10)
print(titanic.tail(10))

Groupby and Filtering

We already know how to perform calculation for a certain column after grouping. But what if we want to filter the column by a certain criterion before aggregating? For example, we may want to know all the Chevorlet car’s 'mpg' mean by year, and all the non-Chevorlet car’s 'mpg' mean by year.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
splitting = auto.groupby('yr')
type(splitting) #groupby object
type(splitting.groups) #dict
print(splitting.groups.keys()) #returns the group by criterion: yr. Its values are the corresponding rows in the orginal df

# We can iterate over the splitting object and carry out filtering and computations
for group_name, group in splitting: #key and value
avg = group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean()
print(group_name, avg)

# To simplify the process, we can rewrite the above for loop as a dictionary comprehension
chevy_means = {year:group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean() for year, group in splitting}
pd.Series(chevy_means) #all the mean 'mpg' for chevrolet car

# Boolean groupby
chevy = auto['name'].str.contains('chevrolet') #boolean series with same index in the groupby
auto.groupby(['yr', chevy])['mpg'].mean() #returns for both chevy and non-chevy

Another example: analyze survival rates from the ‘C’ deck, which contained the most passengers, based on ‘sex’ group.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby('sex')

# Defining function: c_deck_survival
def c_deck_survival(gr):
c_passengers = gr['cabin'].str.startswith('C').fillna(False) #cabin c, fill all the NaN with False
return gr.loc[c_passengers, 'survived'].mean()

# Call by_sex.apply with the function c_deck_survival and print the result
c_surv_by_sex = by_sex.apply(c_deck_survival)

# Print the survival rates
print(c_surv_by_sex)

Filtering and Grouping with .filter()

We can also use groupby with the .filter() method to remove whole groups of rows from a DataFrame based on a boolean condition.
In this exercise, you’ll take the February sales data and remove entries from companies that purchased less than 35 Units in the whole month.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Read the CSV file into a DataFrame: sales
sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)

# Group sales by 'Company': by_company
by_company = sales.groupby('Company')

# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
print(by_com_sum)

# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g: g['Units'].sum() > 35)
print(by_com_filt)

Filtering and Grouping with .map()

1
2
3
4
5
6
7
8
9
10
# Create the Boolean Series: under10
under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'})

# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10).survived.mean()
print(survived_mean_1)

# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, 'pclass']).survived.mean()
print(survived_mean_2)

Case Study

Using .value_counts() for ranking

1
2
3
4
5
6
7
8
# Select the 'NOC' column of medals: country_names
country_names = medals.NOC

# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()

# Print top 15 countries ranked by medals
print(medal_counts.head(15))

Using .pivot_table() to count medals by type

1
2
3
4
5
6
7
8
9
10
11
# Construct the pivot table: counted
counted = medals.pivot_table(index = 'NOC', columns = 'Medal', values = 'Athlete', aggfunc = 'count')

# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis = 'columns') #row sum

# Sort counted by the 'totals' column
counted = counted.sort_values('totals', ascending = False)

# Print the top 15 rows of counted
print(counted.head(15))

Applying .drop_duplicates()

The duplicates can be dropped using the .drop_duplicates() method, leaving behind the unique observations.

1
2
3
4
5
6
7
8
# Select columns: ev_gen
ev_gen = medals.loc[:, ['Event_gender', 'Gender']]

# Drop duplicate pairs: ev_gen_uniques
ev_gen_uniques = ev_gen.drop_duplicates()

# Print ev_gen_uniques
print(ev_gen_uniques)

Finding possible errors with .groupby()

You will now use .groupby() to continue your exploration. Your job is to group by ‘Event_gender’ and ‘Gender’ and count the rows.

You will see that there is only one suspicious row: This is likely a data error.

1
2
3
4
5
6
7
8
# Group medals by the two columns: medals_by_gender
medals_by_gender = medals.groupby(['Event_gender', 'Gender'])

# Create a DataFrame with a group count: medal_count_by_gender
medal_count_by_gender = medals_by_gender.count()

# Print medal_count_by_gender
print(medal_count_by_gender)

Locating suspicious data

1
2
3
4
5
6
7
8
# Create the Boolean Series: sus
sus = (medals.Event_gender == 'W') & (medals.Gender == 'Men')

# Create a DataFrame with the suspicious row: suspect
suspect = medals[sus]

# Print suspect
print(suspect)

.idxmax() and .idxmin() can return the row or column label where maximum or minimum value is located. By default, it returns the row label, but to compare column wise, we can specify .idxmax(axis = 'columns')

Using .nunique() to rank by distinct sports

You may want to know which countries won medals in the most distinct sports. The .nunique() method is the principal aggregation here. Given a categorical Series S, S.nunique() returns the number of distinct categories.

1
2
3
4
5
6
7
8
9
10
11
# Group medals by 'NOC': country_grouped
country_grouped = medals.groupby('NOC')

# Compute the number of distinct sports in which each country won medals: Nsports
Nsports = country_grouped.Sport.nunique()

# Sort the values of Nsports in descending order
Nsports = Nsports.sort_values(ascending = False)

# Print the top 15 rows of Nsports
print(Nsports.head(15))

Using boolean series to filter the result (.isin())

To aggregate the number of distinct sports in which the USA and the USSR won medals during the Cold War years

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war
during_cold_war = (medals.Edition >= 1952) & (medals.Edition <= 1988) #boolean series

# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs
is_usa_urs = medals.NOC.isin(['USA', 'URS']) #boolean series

# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals
cold_war_medals = medals[during_cold_war & is_usa_urs]

# Group cold_war_medals by 'NOC'
country_grouped = cold_war_medals.groupby('NOC')

# Create Nsports
Nsports = country_grouped.Sport.nunique().sort_values(ascending = False)

# Print Nsports
print(Nsports)

For this exercise, you want to see which country, the USA or the USSR, won the most medals consistently over the Cold War period.

1
2
3
4
5
6
7
8
9
10
11
# Create the pivot table: medals_won_by_country
medals_won_by_country = medals.pivot_table(index = 'Edition', columns = 'NOC', values = 'Athlete', aggfunc = 'count')

# Slice medals_won_by_country: cold_war_usa_usr_medals
cold_war_usa_usr_medals = medals_won_by_country.loc[1952:1988, ['USA', 'URS']]

# Create most_medals: winner for each yer
most_medals = cold_war_usa_usr_medals.idxmax(axis = 'columns')

# Print most_medals.value_counts()
print(most_medals.value_counts())

Reshaping DataFrames for Visualization

To visualize the medal counts by ‘Edition’ for the USA. The DataFrame has been pre-loaded for you as medals.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Create the DataFrame: usa
usa = medals.loc[medals.NOC == 'USA']

# Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level = 'Medal')

# Plot the DataFrame usa_medals_by_year
usa_medals_by_year.plot()
usa_medals_by_year.plot(kind = 'area')
plt.show()

You may have noticed that the medals are ordered according to a lexicographic (dictionary) ordering: Bronze < Gold < Silver. However, you would prefer an ordering consistent with the Olympic rules: Bronze < Silver < Gold.

You can achieve this using Categorical types. In this final exercise, after redefining the ‘Medal’ column of the DataFrame medals, you will repeat the area plot from the previous exercise to see the new ordering.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Redefine 'Medal' as an ordered categorical
medals.Medal = pd.Categorical(values = medals.Medal,
categories = ['Bronze', 'Silver', 'Gold'],
ordered = True)

# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
plt.show()