Datacamp course notes on merging dataset with pandas.
Preparing Data
Reading Multiple Files
pandas provides the following tools for loading in datasets:
- pd.read_csv for CSV files
- dataframe = pd.read_csv(filepath)
- dozens of optional input parameters
- Other data import tools:
- pd.read_excel()
- pd.read_html()
- pd.read_json()
To reading multiple data files, we can use a for loop:1
2
3
4
5
6
7import pandas as pd
filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = []
for f in filenames:
dataframes.append(pd.read_csv(f))
dataframes[0] #'sales-jan-2015.csv'
dataframes[1] #'sales-feb-2015.csv'
Or simply a list comprehension:1
2filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = [pd.read_csv(f) for f in filenames]
Or using glob to load in files with similar names:glob()
will create a iterable object: ‘filenames’, containing all matching filenames in the current directory.1
2
3from glob import glob
filenames = glob('sales*.csv') #match any strings that start with prefix 'sales' and end with the suffix '.csv'
dataframes = [pd.read_csv(f) for f in filenames]
Another example:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15for medal in medal_types:
file_name = "%s_top5.csv" % medal
# Read file_name into a DataFrame: medal_df
medal_df = pd.read_csv(file_name, index_col = 'Country')
# Append medal_df to medals
medals.append(medal_df)
# Concatenate medals: medals
medals = pd.concat(medals, keys = ['bronze', 'silver', 'gold'])
# Print medals in entirety
print(medals)
Reindexing DataFrames
The index is a privileged column in Pandas providing convenient access to Series or DataFrame rows.
‘indexes’ vs. ‘indices’
- indices: many index labels within a index data structure
- indexes: many pandas index data structures.
We can access the index directly by .index
attribute. To reindex a dataframe, we can use .reindex()
:1
2
3ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
w_mean3 = w_mean.reindex(w_max.index)
Note that here we can also use other dataframe’s index to reindex the current dataframe. If there are indices that do not exist in the current dataframe, the row will show NaN, which can be dropped via .dropna()
eaisly. This is normally the first step after merging the dataframes. Also, we can use forward-fill or backward-fill to fill in the Nas by chaining .ffill()
or .bfill()
after the reindexing.
To sort the index in alphabetical order, we can use .sort_index()
and .sort_index(ascending = False)
.
To sort the dataframe using the values of a certain column, we can use .sort_values('colname')
Arithmetic with Series & DataFrames
Scalar Mutiplication1
2
3
4import pandas as pd
weather = pd.read_csv('file.csv',
index_col = 'Date', parse_dates = True)
weather.loc['2013-7-1':'2013-7-7', 'Precipitation'] * 2.54 #broadcasting: the multiplication is applied to all elements in the dataframe
Divide()
If we want to get the max and the min temperature column all divided by the mean temperature column1
2
3
4week1_range = weather.loc['2013-07-01':'2013-07-07',
['Min TemperatureF', 'Max TemperatureF']]
week1_mean = weather.loc['2013-07-01':'2013-07-07',
'Mean TemperatureF']
Here, we cannot directly divide the week1_range
by week1_mean
, which will confuse python. Instead, we use .divide()
to perform this operation.1
week1_range.divide(week1_mean, axis = 'rows')
This will broadcast the series week1_mean
values across each row to produce the desired ratios.
To compute the percentage change along a time series, we can subtract the previous day’s value from the current day’s value and dividing by the previous day’s value. The .pct_change()
method does precisely this computation for us.1
2week1_mean.pct_change() * 100 # *100 for percent value.
# The first row will be NaN since there is no previous entry.
Add()
How arithmetic operations work between distinct Series or DataFrames with non-aligned indexes? When we add two panda Series, the index of the sum is the union of the row indices from the original two Series. Arithmetic operations between Panda Series are carried out for rows with common index values. If the indices are not in one of the two dataframe, the row will have NaN.1
2
3
4bronze + silver
bronze.add(silver) #same as above
bronze.add(silver, fill_value = 0) #this will avoid the appearance of NaNs
bronze.add(silver, fill_value = 0).add(gold, fill_value = 0) #chain the method to add more
Tips:
To replace a certain string in the column name:1
2#replace 'F' with 'C'
temps_c.columns = temps_c.columns.str.replace('F', 'C')
Multiply()
In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance.
Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20# Import pandas
import pandas as pd
# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('sp500.csv', parse_dates = True, index_col = 'Date')
# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv('exchange.csv', parse_dates = True, index_col = 'Date')
# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]
# Print the head of dollars
print(dollars.head())
# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis = 'rows')
# Print the head of pounds
print(pounds.head())
Concatenating Data
Appending and Concatenating Series
We can also stack Series on top of one anothe by appending and concatenating using .append()
and pd.concat()
.
.append()
.append()
is a Series and DataFrame method- syntax:
s1.append(s2)
. Stacks rows of s2 below s1 - Stacks rows without adjusting index values by default.
- To discard the old index when appending, we can chain
.reset_index(drop = True)
after appending,
- To discard the old index when appending, we can chain
concat()
concat()
is a pandas module function, and accepts a list or sequance of several Series or DataFrames to concatenate.- symtax:
pd.concat([s1, s2, s3])
- While the
.append()
method can only stack vertically (or row_wise), the functionconcat()
is more flexible, and can concatenate both vertically and horizontally.axis = 'rows'
stacks vertically,axis = 'columns'
stacks horizontally
- Concat without adjusting index values by default.
- To discard the old index when appending, we can specify argument
ignore_index = True
in the funtion.
- To discard the old index when appending, we can specify argument
When stacking multiple Series, pd.concat()
is in fact equivalent to chaining method calls to .append()
result1 = pd.concat([s1, s2, s3])
= result2 = s1.append(s2).append(s3)
Append then concat1
2
3
4
5
6
7
8
9# Initialize empty list: units
units = []
# Build the list of Series
for month in [jan, feb, mar]:
units.append(month['Units'])
# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis = 'rows')
Appending and Concatenating DataFrames
df1.append(df2)
- By default, the dataframes are stacked row-wise (vertically).
- If the two dataframes have identical index names and column names, then the appended result would also display identical index and column names.
- If the two dataframes have different index and column names:
- If there is a index that exist in both dataframes, there will be two rows of this particular index, one shows the original value in df1, one in df2. Different columns are unioned into one table. NaNs are filled into the values that come from the other dataframe.
pd.concat([df1, df2])
- By default, the dataframes are stacked row-wise (vertically).
- If the two dataframes have identical index names and column names, then the appended result would also display identical index and column names.
- If we use
pd.concat([df1, df2], axis = 1)
orpd.concat([df1, df2], axis = 'columns')
stacks dataframe columns horizontally on the right.- If there is a index that exist in both dataframes, the row will get populated with values from both dataframes when concatenating.
Example: Reading multiple files to build a DataFrame.
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You’ll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22import pandas as pd
medal = []
medal_types = ['bronze', 'silver', 'gold']
for medal in medal_types:
# Create the file name: file_name
file_name = "%s_top5.csv" % medal
# Create list of column names: columns
columns = ['Country', medal]
# Read file_name into a DataFrame: df
medal_df = pd.read_csv(file_name, header = 0, index_col = 'Country', names = columns)
# Append medal_df to medals
medals.append(medal_df)
# Concatenate medals horizontally: medals
medals = pd.concat(medals, axis = 'columns')
# Print medals
print(medals)
The expression "%s_top5.csv" % medal
evaluates as a string with the value of medal
replacing %s
in the format string.
Concatenation, Keys & MultiIndexes
In order to differentiate data from different dataframe but with same column names and index:
we can use
keys
to create a multilevel index. The order of the list of keys should match the order of the list of dataframe when concatenating.1
2
3
4
5# Concatenating vertically
rain1314 = pd.concat([rain2013, rain2014], key = [2013, 2014], axis = 0)
# Slicing the out-most index
print(rain1314.loc[2014])or we can concat the columns to the right of the dataframe with argument
axis = 1
oraxis = columns
. To avoid repeated column indices, again we need to specifykeys
to create a multi-level column index.1
2
3
4
5
6
7
8
9# Concatenating horizontally
rain1314 = pd.concat([rain2013, rain2014], key = [2013, 2014], axis = 1)
# Slicing the outer-most column index
print(rain1314[2014])
# Slicing the inner column index
idx = pd.IndexSlice
print(rain1314[idx[:, 'Jan'], :])
or use a dictionary instead. In that case, the dictionary keys are automatically treated as values for the keys
in building a multi-index on the columns.1
2rain_dict = {2013:rain2013, 2014:rain2014}
rain1314 = pd.concat(rain_dict, axis = 1)
Another example:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]
# Create an empty dictionary: month_dict
month_dict = {}
for month_name, month_data in month_list:
# Group month_data: month_dict[month_name]
month_dict[month_name] = month_data.groupby('Company').sum()
# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)
# Print sales
print(sales) #outer-index=month, inner-index=company
# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:, 'Mediacore'], :])
Outer & Inner Joins
We can stack dataframes vertically using append()
, and stack dataframes either vertically or horizontally using pd.concat()
. pd.concat()
is also able to align dataframes cleverly with respect to their indexes.1
2
3
4
5
6
7
8
9
10
11
12
13import numpy as np
import pandas as pd
A = np.arange(8).reshape(2, 4) + 0.1
B = np.arange(6).reshape(2, 3) + 0.2
C = np.arange(12).reshape(3, 4) + 0.3
# Since A and B have same number of rows, we can stack them horizontally together
np.hstack([B, A]) #B on the left, A on the right
np.concatenate([B, A], axis = 1) #same as above
# Since A and C have same number of columns, we can stack them vertically
np.vstack([A, C])
np.concatenate([A, C], axis = 0)
A ValueError exception is raised when the arrays have different size along the concatenation axis
Joining tables involves meaningfully gluing indexed rows together.
Note: we don’t need to specify the join-on column here, since concatenation refers to the index directly
- Outer join preserves the indices in the original tables filling null values for missing rows.
- Union of index sets (all labels, no repetition)
- Missing fields filled with NaN
- Inner join has only index labels common to both tables
- Intersection of index sets
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16pd.concat([population, unemployment], axis = 1, join = 'outer')
pd.concat([population, unemployment],
keys = ['pop', 'unemploy'], axis = 1, join = 'inner')
# Resampling and concatenating
# Resample and tidy china: china_annual
china_annual = china.resample('A').pct_change(10).dropna()
# Resample and tidy us: us_annual
us_annual = us.resample('A').pct_change(10).dropna()
# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual, us_annual], join = 'inner', axis = 1)
# Resample gdp and print
print(gdp.resample('10A').last())
- Intersection of index sets
Merging Data
Very often, we need to combine DataFrames either along multiple columns or along columns other than the index, where merging will be used.
merge()
function extends concat()
with the ability to align rows using multiple columns.
Inner join
Merge all columns that occur in both dataframes:
pd.merge(population, cities)
. It performs inner join, which glues together only rows that match in the joining column of BOTH dataframes.Merge on a particular column or columns that occur in both dataframes:
pd.merge(bronze, gold, on = ['NOC', 'country'])
.
We can further tailor the column names withsuffixes = ['_bronze', '_gold']
to replace the suffixed_x
and_y
When the columns to join on have different labels:
pd.merge(counties, cities, left_on = 'CITY NAME', right_on = 'City')
. This way, both columns used to join on will be retained.
Left join & Right join
It keeps all rows of the left dataframe in the merged dataframe.
- For rows in the left dataframe with matches in the right dataframe, non-joining columns of right dataframe are appended to left dataframe.
- For rows in the left dataframe with no matches in the right dataframe, non-joining columns are filled with nulls.
And vice versa for right join.
Outer join
Outer join is a union of all rows from the left and right dataframes.
Besides using pd.merge()
, we can also use pandas built-in method .join()
to join datasets.1
2
3
4
5
6
7
8
9
10
11# By default, it performs left-join using the index, the order of the index of the joined dataset also matches with the left dataframe's index
population.join(unemployment)
# it can also performs a right-join, the order of the index of the joined dataset also matches with the right dataframe's index
population.join(unemployment, how = 'right')
# inner-join
population.join(unemployment, how = 'inner')
# outer-join, sorts the combined index
population.join(unemployment, how = 'outer')
Which merging/joining method should we use?
The simpler the better.
- To stack two Series or DataFrames vertically:
df1.append(df2)
- To stack many horizontally or vertically, or perform simple inner/outer joins on indexes:
pd.concat([df1, df2])
- To perform simple left/right/inner/outer joins on indexes:
df1.join(df2)
- To perform many joins on multiple columns:
pd.merge([df1, df2])
Ordered merges
We often want to merge dataframes whose columns have natural orderings, like date-time columns.
merge_ordered()
pd.merge_ordered()
can join two datasets with respect to their original order. The merged dataframe has rows sorted lexicographically accoridng to the column ordering in the input dataframes. By default, it performs outer-join1
pd.merge_ordered(hardware, software, on = ['Date', 'Company'], suffixes = ['_hardware', '_software'], fill_method = 'ffill')
To distinguish data from different orgins, we can specify suffixes
in the arguments.
merge_ordered()
can also perform forward-filling for missing values in the merged dataframe. Note: ffill
is not that useful for missing values at the beginning of the dataframe
merge_asof()??
Similar to pd.merge_ordered()
, the pd.merge_asof()
function will also merge values in order using the on
column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on'
column values are less than the left value will be kept.
This function can be use to align disparate datetime frequencies without having to first resample.
Here, you’ll merge monthly oil prices (US dollars) into a full automobile fuel efficiency dataset. The oil and automobile DataFrames have been pre-loaded as oil
and auto
. The first 5 rows of each have been printed in the IPython Shell for you to explore.
These datasets will align such that the first price of the year will be broadcast into the rows of the automobiles DataFrame. This is considered correct since by the start of any given year, most automobiles for that year will have already been manufactured.
Case Study: Medals in the Summer Olympics
You have a sequence of files summer_1896.csv
, summer_1900.csv
, …, summer_2008.csv
, one for each Olympic edition (year).
You will build up a dictionary medals_dict
with the Olympic editions (years) as keys and DataFrames as values.
The dictionary is built up inside a loop over the year
of each Olympic edition (from the Index of editions
).
Once the dictionary of DataFrames is built up, you will combine the DataFrames using pd.concat()
.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26# Import pandas
import pandas as pd
# Create empty dictionary: medals_dict
medals_dict = {}
for year in editions['Edition']:
# Create the file path: file_path
file_path = 'summer_{:d}.csv'.format(year)
# Load file_path into a DataFrame: medals_dict[year]
medals_dict[year] = pd.read_csv(file_path)
# Extract relevant columns: medals_dict[year]
medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]
# Assign year to column 'Edition' of medals_dict
medals_dict[year]['Edition'] = year
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index = True) #ignore_index reset the index from 0
# Print first and last 5 rows of medals
print(medals.head())
print(medals.tail())
Counting medals by country/edition in a pivot table1
2
3
4
5# Construct the pivot_table: medal_counts
medal_counts = medals.pivot_table(index = 'Edition',
columns = 'NOC',
values = 'Athlete',
aggfunc = 'count')
Computing fraction of medals per Olympic edition and the percentage change in fraction of medals won1
2
3
4
5
6
7
8
9
10
11
12# Set Index of editions: totals
totals = editions.set_index('Edition')
# Reassign totals['Grand Total']: totals
totals = totals['Grand Total']
# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals, axis = 'rows')
# Print first & last 5 rows of fractions
print(fractions.head())
print(fractions.tail())
Expanding Windows
http://pandas.pydata.org/pandas-docs/stable/computation.html#expanding-windows
A common alternative to rolling statistics is to use an expanding window, which yields the value of the statistic with all the data available up to that point in time.
These follow a similar interface to .rolling
, with the .expanding
method returning an Expanding object.
As these calculations are a special case of rolling statistics, they are implemented in pandas such that the following two calls are equivalent:1
2df.rolling(window = len(df), min_periods = 1).mean()[:5]
df.expanding(min_periods = 1).mean()[:5]
To see if there is a host country advantage, you first want to see how the fraction of medals won changes from edition to edition.
The expanding mean provides a way to see this down each column. It is the value of the mean with all the data available up to that point in time.
1 | # Apply the expanding mean: mean_fractions |
Reshaping for analysis1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16# Import pandas
import pandas as pd
# Reshape fractions_change: reshaped
reshaped = pd.melt(fractions_change,
id_vars = 'Edition',
value_name = 'Change')
# Print reshaped.shape and fractions_change.shape
print(reshaped.shape, fractions_change.shape)
# Extract rows from reshaped where 'NOC' == 'CHN': chn
chn = reshaped[reshaped.NOC == 'CHN']
# Print last 5 rows of chn with .tail()
print(chn.tail())
Visualization1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31# Import pandas
import pandas as pd
# Merge reshaped and hosts: merged
merged = pd.merge(reshaped, hosts, how = 'inner')
# Print first 5 rows of merged
print(merged.head())
# Set Index of merged and sort it: influence
influence = merged.set_index('Edition').sort_index()
# Print first 5 rows of influence
print(influence.head())
# Import pyplot
import matplotlib.pyplot as plt
# Extract influence['Change']: change
change = influence['Change']
# Make bar plot of change: ax
ax = change.plot(kind = 'bar')
# Customize the plot to improve readability
ax.set_ylabel("% Change of Host Country Medal Count")
ax.set_title("Is there a Host Country Advantage?")
ax.set_xticklabels(editions['City'])
# Display the plot
plt.show()