Datacamp course notes on pandas foundation.
Data Ingestion & Inspection
Basic Concepts of Pandas DataFrames
Pandas DataFrame is a tabular data structure with columns and rows. Rows are labeled by a data structure called index, which is tabled lists of labels that permit fast look up and powerful relational operations
Columns are also labeld by a list index. Each column is a specialized data structure called a panda series, which is a 1-d labeled numpy array. DataFrame is a 2-d numpy array.
Each series has its own head method and inherit its name attributes from its column. To extract the numerical values of the series, we use the .values
attribute
Assigning NAs to every third element/row in the last column
1 | import numpy as np |
Create DataFrames
Create dataframe by combining lists
1 | # Zip the 2 lists together into one list of (key,value) tuples: zipped |
Creating dataframe by converting dictionaries
1 | # Make a string with the value 'PA': state |
Importing and Exporting Data
1 | colnames = ['a', 'b', 'c', 'd'] |
Ploting the data
1 | import matplotlib.pyplot as plt |
Exploratory Data Analysis
Iris data set: 150 observations, 4 features each
- Sepal length, width, Petal length and width.
- 3 Species: setosa, versicolor, virginica
Visual EDA
1 | import pandas as pd |
- Scatter plot:
kind = 'scatter'
- Boxplot:
kind = 'box'
- Histogram:
kind = 'hist'
.- bins(integer): number of intervals or bins
- range(tuple): extrema of bins (min, max)
- normed(boolean): whether to normalize to one
- cumulative(boolean): compute Cumulative Distribution Function(CDF)
Note: There are three different DataFrame plot idioms:
iris.plot(kind = 'hist')
iris.plt.hist()
iris.hist()
Their syntax and results also differ from each other.
Formatting the plots. Plotting PDF, CDF
1 | # This formats the plots such that they appear on separate rows |
Statistical EDA
.describe()
for numeric values: count, mean, std, min, 25%, median, 75%, max.describe()
for categorical values:
- count, which is the number of non-null entries
- unique, number of distinct values
- top, most frequent category
- freq, number of occurences of top
.unique()
returns the unique values in a column.mean()
:.mean(axis = 'columns')
computes the mean across all columns per row..std()
.median()
.quantile(q)
where q = (0, 1). To calculate intern-quartile range (IQR): q = [0.25, 0.75]. We can also use boxplot to visualy understand the quartile..min()
,.max()
.count()
Filtering by species
1 | indices = iris['species'] == 'setosa' |
Time Series in Pandas
read_csv()
function can read strings into datetime objects with argument parse_dates = True
. The date will be stored as yyyy-mm-dd hh:mm:ss
. In time series data, it is also useful to set the date column as index, so that we can perform date time slicing easily.
1 | import pandas as pd |
Resampling pandas time series
Resampling here is a statistical method used over different time intervals.
Downsampling reduce datetime rows to slower frequency
Upsampling increase datetime rows to faster frequency
List of most frequently used resampling input:'min'
, 'T'
: minute'H'
: hour'D'
: day'B'
: business day'W'
: week'M'
: month'Q'
: quarter'A'
: year
Downsampling
1 | # Aggregation |
Upsampling
The below example upsamples the daily records to 4-hour interval records.
1 | two_days = sales.loc['2015-2-4':'2015-2-5', 'Units'] |
Rolling time window by .rolling()
. For example, with a Series hourly_data
, hourly_data.rolling(window=24).mean()
would compute new values for each hourly point, based on a 24-hour window stretching out behind each point. The frequency of the output data is the same: it is still hourly. Such an operation is useful for smoothing time series data.
1 | # Extract data from 2010-Aug-01 to 2010-Aug-15: unsmoothed |
Manipulating pandas time series
String methods
.str.upper()
can convert the string to upper case..str.contains('a')
can search the key word a, and returns a boolean series. Therefore, with.str.contains('a').sum()
we can count the number of the records that contain ‘a’..str.string()
can strip extra whitespace1
times_tz_none = pd.to_datetime( la['Date (MM/DD/YYYY)'] + ' ' + la['Wheels-off Time'] )
Datetime methods
.dt.hour
can extract the hour of the datetime value as a series.dt.tz_localize('US/Central')
and.dt.tz_convert('US/Eastern')
can convert the time zone. First use localize then use convert.
Visualizing pandas time series
Style format string:
- color (k: black), (b: blue), (g: green), (r: red), (c: cyan)
- marker (.: dot), (o: circle), (*: star), (s: square), (+: plus)
- line type (-: solid), (:dotted), (–: dashed)
area plot :kind = 'area'
To plot plots separately: subplot = True
Before plotting time series data, make sure the set the index to datetime object.
1 | # Convert the 'Date' column into a collection of datetime objects: df.Date |
Case Study
1 | import pandas as pd |
pandas has been imported in the workspace as pd
, and the file NOAA_QCLCD_2011_hourly_13904.txt
has been parsed and loaded into a DataFrame df
. The comma separated string of column names, column_labels
, and list of columns to drop, list_to_drop
, have also been loaded for you.
1 | # Split on the comma to create a list: column_labels_list |
Convert the index into datetime object
1 | # Convert the date column to string: df_dropped['date'] |
Convert temperature columns into numeric values
1 | # Convert the dry_bulb_faren column to numeric values: df_clean['dry_bulb_faren'] |
Now the data is clean, we can start EDA.
Statistical EDA:
1 | # Print the median of the dry_bulb_faren column |
.reset_index()
can set the index back to 0-based indexing
1 | # Downsample df_clean by day and aggregate by mean: daily_mean_2011 |
Sunny or cloudy?
1 | # Select days that are sunny: sunny |
Correlation between visibility and temperature:
1 | # Import matplotlib.pyplot as plt |
Daily hours of clear sky
1 | # Create a Boolean Series for sunny days: sunny |
Heat or humidity
Dew point is a measure of relative humidity based on pressure and temperature. A dew point above 65 is considered uncomfortable while a temperature above 90 is also considered uncomfortable.
In this exercise, you will explore the maximum temperature and dew point of each month. The columns of interest are ‘dew_point_faren’ and ‘dry_bulb_faren’. After resampling them appropriately to get the maximum temperature and dew point in each month, generate a histogram of these values as subplots. Uncomfortably, you will notice that the maximum dew point is above 65 every month!
1 | # Resample dew_point_faren and dry_bulb_faren by Month, aggregating the maximum values: monthly_max |
Probability of high temperatures
We already know that 2011 was hotter than the climate normals for the previous thirty years. In this final exercise, you will compare the maximum temperature in August 2011 against that of the August 2010 climate normals. More specifically, you will use a CDF plot to determine the probability of the 2011 daily maximum temperature in August being above the 2010 climate normal value. To do this, you will leverage the data manipulation, filtering, resampling, and visualization skills you have acquired throughout this course.
The two DataFrames df_clean
and df_climate
are available in the workspace. Your job is to select the maximum temperature in August in df_climate
, and then maximum daily temperatures in August 2011. You will then filter out the days in August 2011 that were above the August 2010 maximum, and use this to construct a CDF plot.
Once you’ve generated the CDF, notice how it shows that there was a 50% probability of the 2011 daily maximum temperature in August being 5 degrees above the 2010 climate normal value!
1 | # Extract the maximum temperature in August 2010 from df_climate: august_max |