DataCamp - Merging DataFrames with Pandas

Datacamp course notes on merging dataset with pandas.

Preparing Data

Reading Multiple Files

pandas provides the following tools for loading in datasets:

  • pd.read_csv for CSV files
    • dataframe = pd.read_csv(filepath)
    • dozens of optional input parameters
  • Other data import tools:
    • pd.read_excel()
    • pd.read_html()
    • pd.read_json()

To reading multiple data files, we can use a for loop:

1
2
3
4
5
6
7
import pandas as pd
filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = []
for f in filenames:
dataframes.append(pd.read_csv(f))
dataframes[0] #'sales-jan-2015.csv'
dataframes[1] #'sales-feb-2015.csv'

Or simply a list comprehension:

1
2
filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = [pd.read_csv(f) for f in filenames]

Or using glob to load in files with similar names:
glob() will create a iterable object: ‘filenames’, containing all matching filenames in the current directory.

1
2
3
from glob import glob
filenames = glob('sales*.csv') #match any strings that start with prefix 'sales' and end with the suffix '.csv'
dataframes = [pd.read_csv(f) for f in filenames]

Another example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for medal in medal_types:

file_name = "%s_top5.csv" % medal

# Read file_name into a DataFrame: medal_df
medal_df = pd.read_csv(file_name, index_col = 'Country')

# Append medal_df to medals
medals.append(medal_df)

# Concatenate medals: medals
medals = pd.concat(medals, keys = ['bronze', 'silver', 'gold'])

# Print medals in entirety
print(medals)

Reindexing DataFrames

The index is a privileged column in Pandas providing convenient access to Series or DataFrame rows.
‘indexes’ vs. ‘indices’

  • indices: many index labels within a index data structure
  • indexes: many pandas index data structures.

We can access the index directly by .index attribute. To reindex a dataframe, we can use .reindex():

1
2
3
ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
w_mean3 = w_mean.reindex(w_max.index)

Note that here we can also use other dataframe’s index to reindex the current dataframe. If there are indices that do not exist in the current dataframe, the row will show NaN, which can be dropped via .dropna() eaisly. This is normally the first step after merging the dataframes. Also, we can use forward-fill or backward-fill to fill in the Nas by chaining .ffill() or .bfill() after the reindexing.

To sort the index in alphabetical order, we can use .sort_index() and .sort_index(ascending = False).

To sort the dataframe using the values of a certain column, we can use .sort_values('colname')

Arithmetic with Series & DataFrames

Scalar Mutiplication

1
2
3
4
import pandas as pd
weather = pd.read_csv('file.csv',
index_col = 'Date', parse_dates = True)
weather.loc['2013-7-1':'2013-7-7', 'Precipitation'] * 2.54 #broadcasting: the multiplication is applied to all elements in the dataframe

Divide()

If we want to get the max and the min temperature column all divided by the mean temperature column

1
2
3
4
week1_range = weather.loc['2013-07-01':'2013-07-07',
['Min TemperatureF', 'Max TemperatureF']]
week1_mean = weather.loc['2013-07-01':'2013-07-07',
'Mean TemperatureF']

Here, we cannot directly divide the week1_range by week1_mean, which will confuse python. Instead, we use .divide() to perform this operation.

1
week1_range.divide(week1_mean, axis = 'rows')

This will broadcast the series week1_mean values across each row to produce the desired ratios.

To compute the percentage change along a time series, we can subtract the previous day’s value from the current day’s value and dividing by the previous day’s value. The .pct_change() method does precisely this computation for us.

1
2
week1_mean.pct_change() * 100 # *100 for percent value.
# The first row will be NaN since there is no previous entry.

Add()

How arithmetic operations work between distinct Series or DataFrames with non-aligned indexes? When we add two panda Series, the index of the sum is the union of the row indices from the original two Series. Arithmetic operations between Panda Series are carried out for rows with common index values. If the indices are not in one of the two dataframe, the row will have NaN.

1
2
3
4
bronze + silver
bronze.add(silver) #same as above
bronze.add(silver, fill_value = 0) #this will avoid the appearance of NaNs
bronze.add(silver, fill_value = 0).add(gold, fill_value = 0) #chain the method to add more

Tips:
To replace a certain string in the column name:

1
2
#replace 'F' with 'C'
temps_c.columns = temps_c.columns.str.replace('F', 'C')

Multiply()

In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Import pandas
import pandas as pd

# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('sp500.csv', parse_dates = True, index_col = 'Date')

# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv('exchange.csv', parse_dates = True, index_col = 'Date')

# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]

# Print the head of dollars
print(dollars.head())

# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis = 'rows')

# Print the head of pounds
print(pounds.head())


Concatenating Data

Appending and Concatenating Series

We can also stack Series on top of one anothe by appending and concatenating using .append() and pd.concat().

.append()

  • .append() is a Series and DataFrame method
  • syntax: s1.append(s2). Stacks rows of s2 below s1
  • Stacks rows without adjusting index values by default.
    • To discard the old index when appending, we can chain .reset_index(drop = True) after appending,

concat()

  • concat() is a pandas module function, and accepts a list or sequance of several Series or DataFrames to concatenate.
  • symtax: pd.concat([s1, s2, s3])
  • While the .append() method can only stack vertically (or row_wise), the function concat() is more flexible, and can concatenate both vertically and horizontally.
    • axis = 'rows' stacks vertically, axis = 'columns' stacks horizontally
  • Concat without adjusting index values by default.
    • To discard the old index when appending, we can specify argument ignore_index = True in the funtion.

When stacking multiple Series, pd.concat() is in fact equivalent to chaining method calls to .append()
result1 = pd.concat([s1, s2, s3]) = result2 = s1.append(s2).append(s3)

Append then concat

1
2
3
4
5
6
7
8
9
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
units.append(month['Units'])

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis = 'rows')

Appending and Concatenating DataFrames

df1.append(df2)

  • By default, the dataframes are stacked row-wise (vertically).
    • If the two dataframes have identical index names and column names, then the appended result would also display identical index and column names.
    • If the two dataframes have different index and column names:
      • If there is a index that exist in both dataframes, there will be two rows of this particular index, one shows the original value in df1, one in df2. Different columns are unioned into one table. NaNs are filled into the values that come from the other dataframe.

pd.concat([df1, df2])

  • By default, the dataframes are stacked row-wise (vertically).
    • If the two dataframes have identical index names and column names, then the appended result would also display identical index and column names.
  • If we use pd.concat([df1, df2], axis = 1) or pd.concat([df1, df2], axis = 'columns') stacks dataframe columns horizontally on the right.
    • If there is a index that exist in both dataframes, the row will get populated with values from both dataframes when concatenating.

Example: Reading multiple files to build a DataFrame.
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You’ll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
medal = []
medal_types = ['bronze', 'silver', 'gold']
for medal in medal_types:

# Create the file name: file_name
file_name = "%s_top5.csv" % medal

# Create list of column names: columns
columns = ['Country', medal]

# Read file_name into a DataFrame: df
medal_df = pd.read_csv(file_name, header = 0, index_col = 'Country', names = columns)

# Append medal_df to medals
medals.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medals, axis = 'columns')

# Print medals
print(medals)

The expression "%s_top5.csv" % medal evaluates as a string with the value of medal replacing %s in the format string.

Concatenation, Keys & MultiIndexes

In order to differentiate data from different dataframe but with same column names and index:

  • we can use keys to create a multilevel index. The order of the list of keys should match the order of the list of dataframe when concatenating.

    1
    2
    3
    4
    5
    # Concatenating vertically
    rain1314 = pd.concat([rain2013, rain2014], key = [2013, 2014], axis = 0)

    # Slicing the out-most index
    print(rain1314.loc[2014])
  • or we can concat the columns to the right of the dataframe with argument axis = 1 or axis = columns. To avoid repeated column indices, again we need to specify keys to create a multi-level column index.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # Concatenating horizontally
    rain1314 = pd.concat([rain2013, rain2014], key = [2013, 2014], axis = 1)

    # Slicing the outer-most column index
    print(rain1314[2014])

    # Slicing the inner column index
    idx = pd.IndexSlice
    print(rain1314[idx[:, 'Jan'], :])

or use a dictionary instead. In that case, the dictionary keys are automatically treated as values for the keys in building a multi-index on the columns.

1
2
rain_dict = {2013:rain2013, 2014:rain2014}
rain1314 = pd.concat(rain_dict, axis = 1)

Another example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = {}

for month_name, month_data in month_list:

# Group month_data: month_dict[month_name]
month_dict[month_name] = month_data.groupby('Company').sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
print(sales) #outer-index=month, inner-index=company

# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:, 'Mediacore'], :])

Outer & Inner Joins

We can stack dataframes vertically using append(), and stack dataframes either vertically or horizontally using pd.concat(). pd.concat() is also able to align dataframes cleverly with respect to their indexes.

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
import pandas as pd
A = np.arange(8).reshape(2, 4) + 0.1
B = np.arange(6).reshape(2, 3) + 0.2
C = np.arange(12).reshape(3, 4) + 0.3

# Since A and B have same number of rows, we can stack them horizontally together
np.hstack([B, A]) #B on the left, A on the right
np.concatenate([B, A], axis = 1) #same as above

# Since A and C have same number of columns, we can stack them vertically
np.vstack([A, C])
np.concatenate([A, C], axis = 0)

A ValueError exception is raised when the arrays have different size along the concatenation axis

Joining tables involves meaningfully gluing indexed rows together.
Note: we don’t need to specify the join-on column here, since concatenation refers to the index directly

  • Outer join preserves the indices in the original tables filling null values for missing rows.
    • Union of index sets (all labels, no repetition)
    • Missing fields filled with NaN
  • Inner join has only index labels common to both tables
    • Intersection of index sets
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      pd.concat([population, unemployment], axis = 1, join = 'outer')
      pd.concat([population, unemployment],
      keys = ['pop', 'unemploy'], axis = 1, join = 'inner')

      # Resampling and concatenating
      # Resample and tidy china: china_annual
      china_annual = china.resample('A').pct_change(10).dropna()

      # Resample and tidy us: us_annual
      us_annual = us.resample('A').pct_change(10).dropna()

      # Concatenate china_annual and us_annual: gdp
      gdp = pd.concat([china_annual, us_annual], join = 'inner', axis = 1)

      # Resample gdp and print
      print(gdp.resample('10A').last())

Merging Data

Very often, we need to combine DataFrames either along multiple columns or along columns other than the index, where merging will be used.

merge() function extends concat() with the ability to align rows using multiple columns.

Inner join

  • Merge all columns that occur in both dataframes: pd.merge(population, cities). It performs inner join, which glues together only rows that match in the joining column of BOTH dataframes.

  • Merge on a particular column or columns that occur in both dataframes: pd.merge(bronze, gold, on = ['NOC', 'country']).
    We can further tailor the column names with suffixes = ['_bronze', '_gold'] to replace the suffixed _x and _y

  • When the columns to join on have different labels: pd.merge(counties, cities, left_on = 'CITY NAME', right_on = 'City'). This way, both columns used to join on will be retained.

Left join & Right join

It keeps all rows of the left dataframe in the merged dataframe.

  • For rows in the left dataframe with matches in the right dataframe, non-joining columns of right dataframe are appended to left dataframe.
  • For rows in the left dataframe with no matches in the right dataframe, non-joining columns are filled with nulls.

And vice versa for right join.

Outer join

Outer join is a union of all rows from the left and right dataframes.

Besides using pd.merge(), we can also use pandas built-in method .join() to join datasets.

1
2
3
4
5
6
7
8
9
10
11
# By default, it performs left-join using the index, the order of the index of the joined dataset also matches with the left dataframe's index
population.join(unemployment)

# it can also performs a right-join, the order of the index of the joined dataset also matches with the right dataframe's index
population.join(unemployment, how = 'right')

# inner-join
population.join(unemployment, how = 'inner')

# outer-join, sorts the combined index
population.join(unemployment, how = 'outer')

Which merging/joining method should we use?

The simpler the better.

  • To stack two Series or DataFrames vertically: df1.append(df2)
  • To stack many horizontally or vertically, or perform simple inner/outer joins on indexes: pd.concat([df1, df2])
  • To perform simple left/right/inner/outer joins on indexes: df1.join(df2)
  • To perform many joins on multiple columns: pd.merge([df1, df2])

Ordered merges

We often want to merge dataframes whose columns have natural orderings, like date-time columns.

merge_ordered()

pd.merge_ordered() can join two datasets with respect to their original order. The merged dataframe has rows sorted lexicographically accoridng to the column ordering in the input dataframes. By default, it performs outer-join

1
pd.merge_ordered(hardware, software, on = ['Date', 'Company'], suffixes = ['_hardware', '_software'], fill_method = 'ffill')

To distinguish data from different orgins, we can specify suffixes in the arguments.

merge_ordered() can also perform forward-filling for missing values in the merged dataframe. Note: ffill is not that useful for missing values at the beginning of the dataframe

merge_asof()??

Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.

This function can be use to align disparate datetime frequencies without having to first resample.

Here, you’ll merge monthly oil prices (US dollars) into a full automobile fuel efficiency dataset. The oil and automobile DataFrames have been pre-loaded as oil and auto. The first 5 rows of each have been printed in the IPython Shell for you to explore.

These datasets will align such that the first price of the year will be broadcast into the rows of the automobiles DataFrame. This is considered correct since by the start of any given year, most automobiles for that year will have already been manufactured.


Case Study: Medals in the Summer Olympics

You have a sequence of files summer_1896.csv, summer_1900.csv, …, summer_2008.csv, one for each Olympic edition (year).

You will build up a dictionary medals_dict with the Olympic editions (years) as keys and DataFrames as values.

The dictionary is built up inside a loop over the year of each Olympic edition (from the Index of editions).

Once the dictionary of DataFrames is built up, you will combine the DataFrames using pd.concat().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Import pandas
import pandas as pd

# Create empty dictionary: medals_dict
medals_dict = {}

for year in editions['Edition']:

# Create the file path: file_path
file_path = 'summer_{:d}.csv'.format(year)

# Load file_path into a DataFrame: medals_dict[year]
medals_dict[year] = pd.read_csv(file_path)

# Extract relevant columns: medals_dict[year]
medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]

# Assign year to column 'Edition' of medals_dict
medals_dict[year]['Edition'] = year

# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index = True) #ignore_index reset the index from 0

# Print first and last 5 rows of medals
print(medals.head())
print(medals.tail())

Counting medals by country/edition in a pivot table

1
2
3
4
5
# Construct the pivot_table: medal_counts
medal_counts = medals.pivot_table(index = 'Edition',
columns = 'NOC',
values = 'Athlete',
aggfunc = 'count')

Computing fraction of medals per Olympic edition and the percentage change in fraction of medals won

1
2
3
4
5
6
7
8
9
10
11
12
# Set Index of editions: totals
totals = editions.set_index('Edition')

# Reassign totals['Grand Total']: totals
totals = totals['Grand Total']

# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals, axis = 'rows')

# Print first & last 5 rows of fractions
print(fractions.head())
print(fractions.tail())


Expanding Windows

http://pandas.pydata.org/pandas-docs/stable/computation.html#expanding-windows

A common alternative to rolling statistics is to use an expanding window, which yields the value of the statistic with all the data available up to that point in time.

These follow a similar interface to .rolling, with the .expanding method returning an Expanding object.

As these calculations are a special case of rolling statistics, they are implemented in pandas such that the following two calls are equivalent:

1
2
df.rolling(window = len(df), min_periods = 1).mean()[:5]
df.expanding(min_periods = 1).mean()[:5]


To see if there is a host country advantage, you first want to see how the fraction of medals won changes from edition to edition.

The expanding mean provides a way to see this down each column. It is the value of the mean with all the data available up to that point in time.

1
2
3
4
5
6
7
8
9
10
11
12
# Apply the expanding mean: mean_fractions
mean_fractions = fractions.expanding().mean()

# Compute the percentage change: fractions_change
fractions_change = mean_fractions.pct_change() * 100

# Reset the index of fractions_change: fractions_change
fractions_change = fractions_change.reset_index()

# Print first & last 5 rows of fractions_change
print(fractions_change.head())
print(fractions_change.tail())

Reshaping for analysis

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Import pandas
import pandas as pd

# Reshape fractions_change: reshaped
reshaped = pd.melt(fractions_change,
id_vars = 'Edition',
value_name = 'Change')

# Print reshaped.shape and fractions_change.shape
print(reshaped.shape, fractions_change.shape)

# Extract rows from reshaped where 'NOC' == 'CHN': chn
chn = reshaped[reshaped.NOC == 'CHN']

# Print last 5 rows of chn with .tail()
print(chn.tail())

Visualization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Import pandas
import pandas as pd

# Merge reshaped and hosts: merged
merged = pd.merge(reshaped, hosts, how = 'inner')

# Print first 5 rows of merged
print(merged.head())

# Set Index of merged and sort it: influence
influence = merged.set_index('Edition').sort_index()

# Print first 5 rows of influence
print(influence.head())

# Import pyplot
import matplotlib.pyplot as plt

# Extract influence['Change']: change
change = influence['Change']

# Make bar plot of change: ax
ax = change.plot(kind = 'bar')

# Customize the plot to improve readability
ax.set_ylabel("% Change of Host Country Medal Count")
ax.set_title("Is there a Host Country Advantage?")
ax.set_xticklabels(editions['City'])

# Display the plot
plt.show()