Statistical Thinking in Python

Learn how to think and analyze statistically with Python (Datacamp).

Exploratory Data Analysis (EDA)

Graphical EDA

We usually use matplotlib.pyplot and seaborn to draw easy graphs to help our understanding of the dataset.

Key Takeaways:

  • Use Seaborn style to make the graph look prettier.
  • Always attach labels to the axes.

Histogram

Key takeaways:

  • Number of bins is the square root of number of data points.
  • 2 drawbacks:
    • Bining bias: the same dat amay be interpreted differently depending on choice of bins
    • We are not plotting all of the data, since they are swept into bins and we lost their actual values.
    • A Bee swarm plot can help with the above drawbacks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Using seaborn to set default style
sns.set()

# Compute number of data points: n_data
n_data = len(versicolor_petal_length)

# Number of bins is the square root of number of data points: n_bins
n_bins = int(np.sqrt(n_data))

# plot
_ = plt.hist(versicolor_petal_length, bins = n_bins) # a numpy array

# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')

# Show histogram
plt.show()

Bee Swarm Plot

Key Takeaways:

  • Organize your data into pandas dataframe before drawing the plot.
  • You need to specify x and y, and also the dataframe you are using when drawing the plot.
  • A good way to distingush the distribution of data from different categories.
  • Limitation: When there are way too many data points, it tend to obsculate the data
    • We can solve this by Empirical Cumulative Distribution Function (ECDF)
1
2
3
4
5
6
7
8
9
# Create bee swarm plot with Seaborn's default settings
_ = sns.swarmplot(x = 'species', y = 'petal length (cm)', data = df)

# Label the axes
_ = plt.xlabel('species')
_ = plt.ylabel('petal length (cm)')

# Show the plot
plt.show()

Empirical Cumulative Distribution Function (ECDF)

Basics:

  • X-value is the quantity you are measuring.
    • The x-values are the sorted data. Use the np.sort() function to perform the sorting.
  • Y-value is the fraction of data points that have a value smaller than the corresponding x-value.
    • The y data of the ECDF go from 1/n to 1 in equally spaced increments. You can construct this using np.arange(). Remember, however, that the end value in np.arange() is not inclusive. Therefore, np.arange() will need to go from 1 to n+1. Be sure to divide this by n.
  • It can not only show all the data, but also give you a compleyte picture of how data are distributed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Define the ECDF function to be reused in the future
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
n = len(data)

# x-data for the ECDF: x
x = np.sort(data)

# y-data for the ECDF: y
y = np.arange(1, n + 1) / n

return x, y

# Compute ECDFs
x_set, y_set = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
x_virg, y_virg = ecdf(virginica_petal_length)

# Plot all ECDFs on the same plot, only the dots in line shape but without the line
_ = plt.plot(x_set, y_set, marker = '.', linestyle = 'none')
_ = plt.plot(x_vers, y_vers, marker = '.', linestyle = 'none')
_ = plt.plot(x_virg, y_virg, marker = '.', linestyle = 'none')

# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

Scatter Plot

When you want to compare two variables, we can use scatter plot.

1
2
3
4
5
6
7
8
9
# Make a scatter plot
_ = plt.plot(versicolor_petal_length, versicolor_petal_width, marker = '.', linestyle = 'none')

# Label the axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('petal width (cm)')

# Show the result
plt.show()

Quantitative EDA

Mean & Median

  • Mean:
    • Easy to compute
    • Heavily influenced by outliers
    • np.mean()
  • Median:
    • The middle value of a data set
    • Immune to outliers
    • np.median()

Percentiles

  • Percentiles
    • np.percentile(, [25, 50. 75]) we need to enter the percentiles that we want into this function
    • Boxplot is helpful in visualizing it.
      • Middle line is the median (50th percentile), and the edge of the box is the 25th and 75th percentile. The length of the box is IQR (Inter Quartile Range).
      • The streched out line is either located at 1.5 IQR or the extent of the data.
      • The data points that fall outside the two lines are outliers, since it is common practice to regard all the data points that fall 2 IQR away from median as outliers.
1
2
3
4
5
6
7
8
9
# Create box plot with Seaborn's default settings
_ = sns.boxplot(x = 'species', y = 'petal length (cm)', data = df)

# Label the axes
_ = plt.xlabel('species')
_ = plt.ylabel('petal length (cm)')

# Show the plot
plt.show()

You can also plot the percentiles directly onto ECDF with red markers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Specify array of percentiles: percentiles
percentiles = np.array([2.5, 25, 50, 75, 97.5])

# Compute percentiles: ptiles_vers
ptiles_vers = np.percentile(versicolor_petal_length, percentiles)

# Plot the ECDF
_ = plt.plot(x_vers, y_vers, '.')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Overlay percentiles as red diamonds.
_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',
linestyle='none')

Variance and Standard Deviation

  • Variance
    • The mean squared distance of the data from their mean, or informally, a measure of the spread of data.

Calculation of variance:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Array of differences to mean: differences
differences = versicolor_petal_length - np.mean(versicolor_petal_length)

# Square the differences: diff_sq
diff_sq = differences**2

# Compute the mean square difference: variance_explicit
variance_explicit = np.mean(diff_sq)

# Compute the variance using NumPy: variance_np
variance_np = np.var(versicolor_petal_length)

# Print the results (they are the same)
print(variance_explicit, variance_np)

  • Standard Deviation
    • The square root of variance

Calculation of Standard Deviation:

1
2
3
4
5
6
7
8
# Compute the variance: variance
variance = np.var(versicolor_petal_length)

# Print the square root of the variance
print(np.sqrt(variance))

# Print the standard deviation
print(np.std(versicolor_petal_length))

Covariance and Pearson Correlation Coefficient

  • Covariance

    • A measure of how two quantities vary together
    • $covariance = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \overline{x})(y_i - \overline{y})$
    • For most of the data points, if their differences to the two quantities’ means are all in the same direction (either all greater than the mean or all smaller than the mean) , then the covariance is postive, and thus the two quantities are positively or negatively correlated.
    • To avoid the measure having any units, we use the Person Correlation Coefficient
    • Covariance Matrix:

      • The covariance may be computed using the Numpy function np.cov(). For example, we have two sets of data x and y, np.cov(x, y) returns a 2D array where entries [0,1] and [1,0] are the covariances. Entry [0,0] is the variance of the data in x, and entry [1,1] is the variance of the data in y. This 2D output array is called the covariance matrix, since it organizes the self- and covariance.

        | | 0 | 1 |
        |— |—|—|
        |0 |variance of x |covariance |
        |1 |covariance |variance of y|

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      # Compute the covariance matrix: covariance_matrix
      covariance_matrix = np.cov(versicolor_petal_length, versicolor_petal_width)

      # Print covariance matrix
      print(covariance_matrix)

      # Extract covariance of length and width of petals: petal_cov
      petal_cov = covariance_matrix[0,1] #[1,0] is the same

      # Print the length/width covariance
      print(petal_cov)
  • Person Correlation Coefficient

    • $\rho = \text{Person Correlation Coefficient} = \frac{\text{covariance}}{(\text{std of x})(\text{std of y})} = \frac{\text{variability due to codependence}}{\text{independent variability}}$
    • Range from -1 (complete negative correlation) to 1 (complete positive correlation). 0 means there is no correlation at all between these two variables.
    • It is computed using the np.corrcoef() function. Like np.cov(), it takes two arrays as arguments and returns a 2D array. Entries [0,0] and [1,1] are necessarily equal to 1, and the value we are after is entry [0,1].
    • array([[1. , 0.78666809], [0.78666809, 1. ]])
1
2
3
4
5
6
7
8
9
10
11
12
13
def pearson_r(x, y):
"""Compute Pearson correlation coefficient between two arrays."""
# Compute correlation matrix: corr_mat
corr_mat = np.corrcoef(x, y)

# Return entry [0,1]
return corr_mat[0,1]

# Compute Pearson correlation coefficient for I. versicolor: r
r = pearson_r(versicolor_petal_length, versicolor_petal_width)

# Print the result
print(r)

Probablistic Logic

Probability provides a measure of uncertainty. Data are almost never exactly the same when acquired again, and probability allows us to say how much we expect them to vary.

Statistical inference involves taking your data to probabilistic conclusions about what you would expect if you took even more data, and you can make decisions based on these conclusions.

Discrete Variables

Hacker Statistics
Hacker statistics uses simulated repeated measurements to compute probabilities. The logic is as simple as follows:

  1. Determine how to simulate data
  2. Simulate many many times
  3. Probability is approximately fraction of trails with the outcome of interest.

Bernoulli Trial

Bernoulli trial is an experiment that has two options, “success” (True) and “failure” (False).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Define a Bernoulli Trial function
def perform_bernoulli_trials(n, p):
"""Perform n Bernoulli trials with success probability p
and return number of successes."""
# Initialize number of successes: n_success
n_success = 0

# Perform trials
for i in range(n):
# Choose random number between zero and one: random_number
random_number = np.random.random()

# If less than p, it's a success so add one to n_success
if random_number < p:
n_success += 1

return n_success

EXAMPLE
Follwing is an example using bernoulli trials:
Let’s say a bank made 100 mortgage loans. It is possible that anywhere between 0 and 100 of the loans will be defaulted upon. You would like to know the probability of getting a given number of defaults, given that the probability of a default is p = 0.05. To investigate this, you will do a simulation. You will perform 100 Bernoulli trials using the perform_bernoulli_trials() function you wrote and record how many defaults we get. Here, a success is a default. (Remember that the word “success” just means that the Bernoulli trial evaluates to True, i.e., did the loan recipient default?) You will do this for another 100 Bernoulli trials. And again and again until we have tried it 1000 times. Then, you will plot a histogram describing the probability of the number of defaults.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Seed random number generator
np.random.seed(42)

# Initialize the number of defaults: n_defaults
n_defaults = np.empty(1000)

# Compute the number of defaults
for i in range(1000):
n_defaults[i] = perform_bernoulli_trials(100, 0.05)


# Plot the histogram with default number of bins; label your axes
_ = plt.hist(n_defaults, normed = True)
# Include the normed = True keyword argument
# so that the height of the bars of the histogram indicate the probability.

_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('probability')

# Show the plot
plt.show()

The above code performs the same way as np.random.binomial(n, p, size = x), just broke it down into more detailde steps. x is the number of for loops.

If interest rates are such that the bank will lose money if 10 or more of its loans are defaulted upon, what is the probability that the bank will lose money?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Compute ECDF: x, y
x, y = ecdf(n_defaults)

# Plot the ECDF with labeled axes
_ = plt.plot(x, y, marker = '.', linestyle = 'none')
_ = plt.xlabel('number of defaults out of 100')
_ = plt.ylabel('CDF')

# Show the plot
plt.show()

# Compute the number of 100-loan simulations with 10 or more defaults: n_lose_money
n_lose_money = np.sum(n_defaults >= 10)

# Compute and print probability of losing money
print('Probability of losing money =', n_lose_money / len(n_defaults))

Binomial Distribution

Probability Mass Function (PMF) is the set of probabilities of discrete outcomes.

When you are rolling a dice, the outcome is discrete, with the same probability, and therefore it has a discrete uniform PMF.

Probability Distribution is a mathematical description of outcomes.

Binomial Distribution: the number $r$ of sucesses in $n$ Bernoulli trials with probability $p$ of success, is Binomially distributed. np.random.binomial(n, p, size = x)

Poisson Distribution

Poisson process: the timing of the next event is completely independent of when the previous event happened.

  • Examples:
    • Natural births in a given hospital
    • Hit on a website during a given hour
    • Metror strikes
    • Molecular collisions in a gas
    • Aviation incidents
    • Buses in Poissonville

The number of arrivals of a Poisson process in a given amount of time is Poisson distributed.

Poisson distribution:

  • The number $r$ of arrivals of a Poisson process in a given time interval with average rate of $\lambda$ arrivals per interval is Poisson distributed.
    • One parameter: $\lambda$, the average times of arrival in a given time.
    • Example:
      • The number $r$ of hits on a website in one hour with an average hit rate of 6 hits per hour is Poisson distributed. np.random.poisson(6, size = 100000)
  • It is the limit of the Binomial distribution for low probability of success and large number of trials. (rare events)
    • The logic behind: This makes sense if you think about the stories. Say we do a Bernoulli trial every minute for an hour, each with a success probability of 0.1. We would do 60 trials, and the number of successes is Binomially distributed, and we would expect to get about 6 successes. This is just like the Poisson story we discussed in the video, where we get on average 6 hits on a website per hour. So, the Poisson distribution with arrival rate equal to $np$ approximates a Binomial distribution for $n$ Bernoulli trials with probability $p$ of success (with $n$ large and $p$ small). Importantly, the Poisson distribution is often simpler to work with because it has only one parameter instead of two for the Binomial distribution.

The code below emprically proved that Poisson distribution is the limit of Binomial distribution for low probability of success and large number of trials.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Draw 10,000 samples out of Poisson distribution: samples_poisson
samples_poisson = np.random.poisson(10, size = 10000)

# Print the mean and standard deviation
print('Poisson: ', np.mean(samples_poisson),
np.std(samples_poisson))

# Specify values of n and p to consider for Binomial: n, p
n = [20, 100, 1000]
p = [0.5, 0.1, 0.01] # so np is always 10

# Draw 10,000 samples for each n,p pair: samples_binomial
for i in range(3):
samples_binomial = np.random.binomial(n[i], p[i], size = 10000)

# Print results
print('n =', n[i], 'Binom:', np.mean(samples_binomial),
np.std(samples_binomial))

Continuous Variables

Probability Density Function (PDF)

  • Continuous analog to the Probability Mass Function (PMF).
  • Mathematical description of the relative likelihood of observing a value of continuous variable.
  • Can be visualized by setting normed = True when ploting histogram, since the total area under the curve should be 1.
  • The probability of observing a single value dos not make sense, since there is an infinity of numbers. Therefore, the area under the PDF give probabilities. For example, the probability of observing a value greater than x value is $p$, which is $p*100$% of the total area under the PDF.
  • Therefore, we are actually calculating the cumulative distribution function, or CDF, of the distribution curve.

Normal (Gaussian) Distribution

Normal Distribution describes a continuous variable whose PDF has a single symmetric peak.

  • Two true parameters of the distribution: (Make sure not to confuse with the mean and std calculated from the data)
    • Mean, determines where the center of the peak is.
    • Standard deviation, a measure of how wide the oeaj is, or how spread out the data are.

We can check whether a dataset is similar to a normal distribution by comparing its empirical CDF with the theoretical normal CDF (generated by np.random.normal() using directly the data’s mean and std).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)

# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu, sigma, size = 10000)

# Get the CDF of the samples and of the data
x, y = ecdf(belmont_no_outliers)
x_theor, y_theor = ecdf(samples)


# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()

The code below shows the normal curves of distributions with the same mean but different std. The higher the std, the flatter and shorter of the curve.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
samples_std1 = np.random.normal(20, 1, size = 100000)
samples_std3 = np.random.normal(20, 3, size = 100000)
samples_std10 = np.random.normal(20, 10, size = 100000)

# Make histograms
_ = plt.hist(samples_std1, bins = 100, normed = True, histtype = 'step')
_ = plt.hist(samples_std3, bins = 100, normed = True, histtype = 'step')
_ = plt.hist(samples_std10, bins = 100, normed = True, histtype = 'step')

# Make a legend, set limits and show plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(-0.01, 0.42)
plt.show()

# Generate CDFs
x_std1, y_std1 = ecdf(samples_std1)
x_std3, y_std3 = ecdf(samples_std3)
x_std10, y_std10 = ecdf(samples_std10)

# Plot CDFs
_ = plt.plot(x_std1, y_std1, marker = '.', linestyle = 'none')
_ = plt.plot(x_std3, y_std3, marker = '.', linestyle = 'none')
_ = plt.plot(x_std10, y_std10, marker = '.', linestyle = 'none')

# Make a legend and show the plot
_ = plt.margins(0.02) # so that there can be a margin near the plot frame
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right')
plt.show()

  • Key Things to Keep in Mind
    • Do not easily assume normality! Use histogram and CDF to help you make a better judgement.
    • Normal distribution has very light tail, which means the probability of heing >4std away from the mean is very small. Therefore, when you are modeling data as normally distributed, outliers are extremely unlikely. But real life data often have extreme values.

Exponential Distribution

The Exponential Distribution: the waiting time between arrivals of a Poisson process is exponentially distributed. In other words, it describes the wating time between rare events.

  • One parameter: the average waiting time.

EXAMPLE

In earlier exercises, we looked at the rare event of no-hitters in Major League Baseball. Hitting the cycle is another rare baseball event. When a batter hits the cycle, he gets all four kinds of hits, a single, double, triple, and home run, in a single game. Like no-hitters, this can be modeled as a Poisson process, so the time between hits of the cycle are also Exponentially distributed.

How long must we wait to see both a no-hitter and then a batter hit the cycle? The idea is that we have to wait some time for the no-hitter, and then after the no-hitter, we have to wait for hitting the cycle. Stated another way, what is the total waiting time for the arrival of two different Poisson processes? The total waiting time is the time waited for the no-hitter, plus the time waited for the hitting the cycle.

Now, you will write a function to sample out of the distribution described by this story.

1
2
3
4
5
6
7
8
9
def successive_poisson(tau1, tau2, size=1):
"""Compute time for arrival of 2 successive Poisson processes."""
# Draw samples out of first exponential distribution: t1
t1 = np.random.exponential(tau1, size)

# Draw samples out of second exponential distribution: t2
t2 = np.random.exponential(tau2, size)

return t1 + t2

Now, you’ll use your sampling function to compute the waiting time to observe a no-hitter and hitting of the cycle. The mean waiting time for a no-hitter is 764 games, and the mean waiting time for hitting the cycle is 715 games.

1
2
3
4
5
6
7
8
9
10
11
12
# Draw samples of waiting times: waiting_times
waiting_times = successive_poisson(764, 715, size = 100000)

# Make the histogram
_ = plt.hist(waiting_times, bins = 100, normed = True, histtype = 'step')

# Label axes
_ = plt.xlabel('total waiting time (games)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

Parameters

Optimal parameters are values that bring the model in closest agreement with the data.
They are only optimal for the model you chose for your data. And thus if your model is wrong, the optimal parameters are not really meaningful.

Generate theoretical data follows exponential distribution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Seed random number generator
np.random.seed(42)

# Compute mean no-hitter time: tau
tau = np.mean(nohitter_times)

# Draw out of an exponential distribution with parameter tau: inter_nohitter_time
inter_nohitter_time = np.random.exponential(tau, 100000)

# Plot the PDF and label axes
_ = plt.hist(inter_nohitter_time,
bins = 50, histtype = 'step', normed = True)
_ = plt.xlabel('Games between no-hitters')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

Overlay theoretical CDF with empirical CDF

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Create an ECDF from real data: x, y
x, y = ecdf(nohitter_times)

# Create a CDF from theoretical samples: x_theor, y_theor
x_theor, y_theor = ecdf(inter_nohitter_time)

# Overlay the plots
plt.plot(x_theor, y_theor)
plt.plot(x, y, marker='.', linestyle='none')

# Margins and axis labels
plt.margins(0.02) # so that there can be a margin near the plot frame
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')

# Show the plot
plt.show()

Linear Regression by Least Sqaures

Often, we can use a linear function to describe the pattern of the data.

  • Two parameters:
    • Slope: how steep the line is
    • Intercept: where the line crosses the y-axis
      We want to choose the slope and intercept that data points collectively lie as close as possible to the line

Residual is the vertical distance (not the perpendicular) distance between the data point and the line. It is negative when the point lies below the line. Each data point has a residual associated with it.

Least squares is the process of finding the parameters for which the sum of the squares of the residuals is minimal. There are many algorithms that can be used to perform this process. Here, we will use the Numpy function np.polyfit(x, y, polynomial_degree), which performs least squares analysis with polynomial functions. Simply set the polynomial degree to 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')

# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility, 1)

# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')

# Make theoretical line to plot
x = np.array([0, 100])
y = a * x + b

# Add regression line to your plot
_ = plt.plot(x, y)

# Draw the plot
plt.show()

The details behind np.polyfit() function is as follows: it tries to find the minimal residual by testing various combinations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Specify slopes to consider: a_vals
a_vals = np.linspace(0, 0.1, 200)

# Initialize sum of square of residuals: rss
rss = np.empty_like() #returns a new array with the same shape and type as a given array (in this case, a_vals).

# Compute sum of square of residuals for each value of a_vals
for i, a in enumerate(a_vals): # enumerate will return (0, a), (1, b), which means add index to value, and wrap them in tuple
rss[i] = np.sum((fertility - a * illiteracy - b)**2)

# Plot the RSS
plt.plot(a_val, rss, '-')
plt.xlabel('slope (children per woman / percent illiterate)')
plt.ylabel('sum of square of residuals')

plt.show()

Bootstrap

Bootstrapping is the use of resampled data to perform statistical inference. It resamples data with replacement to the same size as the original data.

  • Bootstrap Sample: A resampled array pf data
  • Bootstrap Replicate: A statistic computed from a bootstrap sample.
  • We can use np.random.choice([list], size = len([list])) to get a bootstrap sample, and then compute a bootstrap replicate.

Bootstrap Confidence Interval

Confidence Interval: If we repeat the measurement over and over again, $p$% of the observed values should lie within the $p$% confidence interval.

  • 95% CI: np.percentiles(, [2.5, 97.5])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def bootstrap_replicate_1d(data, func):
return func(np.random.choice(data, size=len(data)))

def draw_bs_reps(data, func, size=1):
"""Draw bootstrap replicates."""

# Initialize array of replicates: bs_replicates
bs_replicates = np.empty(size)

# Generate replicates
for i in range(size):
bs_replicates[i] = bootstrap_replicate_1d(data, func)

return bs_replicates

# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.mean, size = 10000)

# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print(sem)

# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)
print(bs_std) # proved to be the same as the original std

# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel('mean annual rainfall (mm)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show() # showed the distribution of the bootstrap replicates of the mean is Normal

Pairs Bootstrap

Non-parametric Inference: Make no assumptions about the model or probability distribution underlying the data.

But for fitting a linear regression model to existing data, we already have two parameters. Thus, here we want to perform a parametric estimate by using pairs bootstrap.

  • Resample data in pairs. (Since there are two variables).
  • Compute slope and interceptfrom resampled data.
  • Each slope and intercept is a bootstrap replicate.
  • Compute confidence intervals from percentiles of bootstrap replicates.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def draw_bs_pairs_linreg(x, y, size=1):
"""Perform pairs bootstrap for linear regression."""

# Set up array of indices to sample from: inds
inds = np.arange(len(x))

# Initialize replicates: bs_slope_reps, bs_intercept_reps
bs_slope_reps = np.empty(size)
bs_intercept_reps = np.empty(size)

# Generate replicates
for i in range(size):
bs_inds = np.random.choice(inds, size = len(x))
bs_x, bs_y = x[bs_inds], y[bs_inds]
bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)

return bs_slope_reps, bs_intercept_reps

Using the function you just wrote, perform pairs bootstrap to plot a histogram describing the estimate of the slope from the illiteracy/fertility data. Also report the 95% confidence interval of the slope. The data is available to you in the NumPy arrays illiteracy and fertility.

As a reminder, draw_bs_pairs_linreg() has a function signature of draw_bs_pairs_linreg(x, y, size=1), and it returns two values: bs_slope_reps and bs_intercept_reps.

1
2
3
4
5
6
7
8
9
10
11
# Generate replicates of slope and intercept using pairs bootstrap
bs_slope_reps, bs_intercept_reps = draw_bs_pairs_linreg(illiteracy, fertility, size = 1000)

# Compute and print 95% CI for slope
print(np.percentile(bs_slope_reps, [2.5, 97.5]))

# Plot the histogram
_ = plt.hist(bs_slope_reps, bins=50, normed=True)
_ = plt.xlabel('slope')
_ = plt.ylabel('PDF')
plt.show()

A nice way to visualize the variability we might expect in a linear regression is to plot the line you would get from each bootstrap replicate of the slope and intercept.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Generate array of x-values for bootstrap lines: x
x = np.array([0, 100])

# Plot the bootstrap lines
for i in range(100):
_ = plt.plot(x,
bs_slope_reps[i] * x + bs_intercept_reps[i],
linewidth=0.5, alpha=0.2, color='red')

# Plot the data
_ = plt.plot(illiteracy, fertility, marker = '.', linestyle = 'none')

# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()

Hypothesis Testing

How to assess how reasonable it is that our observed data are actually described by the model? This is when hypothesis testing come into play.

Hypothesis Testing: Assessment of how reasonable the observed data are assuming the hypothesis is true. The hypothesis we are testing are called Null Hypothesis.

To simulate a null hypothesis, where we assume two quantities are identically distributed, we use Permutation to randomly reorder entries in an array.

  • Put two sets of data, A (a number of data points)and B (b number of data points), together, and randomly shuffle them, ignoring their origins.
  • Then label the first a shuffled data points as A, the rest as B.
  • Then we draw the two ecdf, and repeat the above process again and again to generate multiple permutation samples.
  • We can compare the new ecdf with the original one to see whether the two overlaps. If not, then the two distributions are not identical.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Write a function to resample the data with permutation
def permutation_sample(data1, data2):
"""Generate a permutation sample from two data sets."""

# Concatenate the data sets: data
data = np.concatenate((data1, data2)) #needs to be put in a tuple

# Permute the concatenated array: permuted_data
permuted_data = np.random.permutation(data)

# Split the permuted array into two: perm_sample_1, perm_sample_2
perm_sample_1 = permuted_data[:len(data1)]
perm_sample_2 = permuted_data[len(data1):]

return perm_sample_1, perm_sample_2

# Visualize Permutated sample distribution
for _ in range(50):
# Generate permutation samples
perm_sample_1, perm_sample_2 = permutation_sample(rain_june, rain_november)

# Compute ECDFs
x_1, y_1 = ecdf(perm_sample_1)
x_2, y_2 = ecdf(perm_sample_2)

# Plot ECDFs of permutation sample
_ = plt.plot(x_1, y_1, marker='.', linestyle='none',
color='red', alpha=0.02)
_ = plt.plot(x_2, y_2, marker='.', linestyle='none',
color='blue', alpha=0.02)

# Create and plot ECDFs from original data
x_1, y_1 = ecdf(rain_june)
x_2, y_2 = ecdf(rain_november)
_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')
_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')

# Label axes, set margin, and show plot
plt.margins(0.02)
_ = plt.xlabel('monthly rainfall (mm)')
_ = plt.ylabel('ECDF')
plt.show()

Notice that the permutation samples ECDFs overlap and give a purple haze. None of the ECDFs from the permutation samples overlap with the observed data, suggesting that the hypothesis is not commensurate with the data. July and November rainfall are not identically distributed.

Test Statistic

Test Statistic:

  • is a single number that can be computed from observed data and from data you simulate under the null hypothesis.
  • It serves as a basis of comparison between what the hypotheiss predicts and what we actually observed.
  • You should choose a test statistic that is pertinent to the question you are trying to answer. e.g. Whether the two states’ vote distribution are identical? If the are identical, their mean should be the same. Therefore, we can choose the difference in means as our test statsitic.
  • If combined with permutation techniques, the calculated statistics can be referred to as permutation replicates.

If we generated enough permutation sample, we can draw a histogram of the permutaion replicates. Here we need to introduce the idea of p-value.

p-value:

  • is the probability of observing a test statistic equally or more extreme than the one you observed, given that the null hypothesis is true.
  • It is NOT the probability that the null hypothesis is true.
  • If it is small, then we say that the data are statistically significant to prove that the null hypothesis is not true.
  • Also, statistical significance $\neq$ practical significance.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def draw_perm_reps(data_1, data_2, func, size=1):
"""Generate multiple permutation replicates."""

# Initialize array of replicates: perm_replicates
perm_replicates = np.empty(size)

for i in range(size):
# Generate permutation sample
perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)

# Compute the test statistic
perm_replicates[i] = func(perm_sample_1, perm_sample_2)

return perm_replicates

def diff_of_means(data_1, data_2):
"""Difference in means of two arrays."""

# The difference of means of data_1, data_2: diff
diff = np.mean(data_1) - np.mean(data_2)

return diff

# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(force_a, force_b)

# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(force_a, force_b,
diff_of_means, size=10000)

# Compute p-value: p
p = np.sum(perm_replicates >= empirical_diff_means) / len(perm_replicates)

# Print the result
print('p-value =', p)

The p-value tells you that there is about a 0.6% chance that you would get the difference of means observed in the experiment if frogs were exactly the same. A p-value below 0.01 is typically said to be “statistically significant,” but: warning! warning! warning! You have computed a p-value; it is a number. I encourage you not to distill it to a yes-or-no phrase. p = 0.006 and p = 0.000000006 are both said to be “statistically significant,” but they are definitely not the same!

Summary of hypothesis testing pipeline

  • Clearly state the null hypothesis (do EDA BEFORE forming Null hypothesis!)
  • Define your test statistic
  • Generate many sets of simulated data assuming the null hypothesis is true
  • Compute the test statistic for each simulated data set
  • The p-value is the fraction of your simulated data sets for which the test statistic is at leas as extreme as for the real data

EXAMPLE: A one-sample bootstrap hypothesis test
Another juvenile frog was studied, Frog C, and you want to see if Frog B and Frog C have similar impact forces. Unfortunately, you do not have Frog C’s impact forces available, but you know they have a mean of 0.55 N. Because you don’t have the original data, you cannot do a permutation test, and you cannot assess the hypothesis that the forces from Frog B and Frog C come from the same distribution. You will therefore test another, less restrictive hypothesis: The mean strike force of Frog B is equal to that of Frog C.

To set up the bootstrap hypothesis test, you will take the mean as our test statistic. Remember, your goal is to calculate the probability of getting a mean impact force less than or equal to what was observed for Frog B if the hypothesis that the true mean of Frog B’s impact forces is equal to that of Frog C is true. You first translate all of the data of Frog B such that the mean is 0.55 N. This involves adding the mean force of Frog C and subtracting the mean force of Frog B from each measurement of Frog B. This leaves other properties of Frog B’s distribution, such as the variance, unchanged.

1
2
3
4
5
6
7
8
9
10
11
12
# Make an array of translated impact forces: translated_force_b
force_c_mean = 0.55
translated_force_b = force_b - np.mean(force_b) + force_c_mean

# Take bootstrap replicates of Frog B's translated impact forces: bs_replicates
bs_replicates = draw_bs_reps(translated_force_b, np.mean, 10000)

# Compute fraction of replicates that are less than the observed Frog B force: p
p = np.sum(bs_replicates <= np.mean(force_b)) / 10000

# Print the p-value
print('p = ', p)

The low p-value suggests that the null hypothesis that Frog B and Frog C have the same mean impact force is false.

EXAMPLE: A two-sample bootstrap hypothesis test for difference of means
We now want to test the hypothesis that Frog A and Frog B have the same mean impact force, but not necessarily the same distribution, which is also impossible with a permutation test.

To do the two-sample bootstrap test, we shift both arrays to have the same mean, since we are simulating the hypothesis that their means are, in fact, equal. We then draw bootstrap samples out of the shifted arrays and compute the difference in means. This constitutes a bootstrap replicate, and we generate many of them. The p-value is the fraction of replicates with a difference in means greater than or equal to what was observed.

The objects forces_concat and empirical_diff_means are already in your namespace.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Compute mean of all forces: mean_force
mean_force = np.mean(forces_concat)

# Generate shifted arrays
force_a_shifted = force_a - np.mean(force_a) + mean_force
force_b_shifted = force_b - np.mean(force_b) + mean_force

# Compute 10,000 bootstrap replicates from shifted arrays
bs_replicates_a = draw_bs_reps(force_a_shifted, np.mean, 10000)
bs_replicates_b = draw_bs_reps(force_b_shifted, np.mean, 10000)

# Get replicates of difference of means: bs_replicates
bs_replicates = bs_replicates_a - bs_replicates_b

# Compute and print p-value: p
p = np.sum(bs_replicates >= (np.mean(force_a) - np.mean(force_b))) / len(bs_replicates)
print('p-value =', p) #0.0043

You got a similar result as when you did the permutation test. Nonetheless, remember that it is important to carefully think about what question you want to ask. Are you only interested in the mean impact force, or in the distribution of impact forces?

A/B Testing

Often, we want to know whether the difference of the CTR of the two web designs are entirely due to random chance, or statistically significant. Therefore, we want to check what is the probability of observing a difference equally or more extreme than the one you observed, given that we assume the difference is entirely due to chance, i.e, the p-value.

We can use permutation here to simulate as if the redesign had no effect on the CTR.

A/B Testing

  • is often used by organizations to see if a change in strategy gives difference, hopefully better, results.
  • The null hypothesis of an A/B test is often: the test statistic is impervious to the change. A low p-value implies that the change in strategy lead to a change in performance.

EXAMPLE: The vote for the Civil Rights Act in 1964

The Civil Rights Act of 1964 was one of the most important pieces of legislation ever passed in the USA. Excluding “present” and “abstain” votes, 153 House Democrats and 136 Republicans voted yea. However, 91 Democrats and 35 Republicans voted nay. Did party affiliation make a difference in the vote?

To answer this question, you will evaluate the hypothesis that the party of a House member has no bearing on his or her vote. You will use the fraction of Democrats voting in favor as your test statistic and evaluate the probability of observing a fraction of Democrats voting in favor at least as small as the observed fraction of 153/244. (That’s right, at least as small as. In 1964, it was the Democrats who were less progressive on civil rights issues.) To do this, permute the party labels of the House voters and then arbitrarily divide them into “Democrats” and “Republicans” and compute the fraction of Democrats voting yea.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Construct arrays of data: dems, reps
dems = np.array([True] * 153 + [False] * 91)
reps = np.array([True] * 136 + [False] * 35)

def frac_yea_dems(dems, reps): #Two inputs are required to use the below draw_perm_reps() function, but the second is not used.
"""Compute fraction of Democrat yea votes."""
frac = np.sum(dems) / len(dems)
return frac

# Acquire permutation samples: perm_replicates
perm_replicates = draw_perm_reps(dems, reps, frac_yea_dems, 10000)

# Compute and print p-value: p
p = np.sum(perm_replicates <= 153/244) / len(perm_replicates)
print('p-value =', p) #p-value = 0.0002

This small p-value suggests that party identity had a lot to do with the voting. Importantly, the South had a higher fraction of Democrat representatives, and consequently also a more racist bias.

Test of Correlation

How can we know whether the correlation is true, or only happens by chance? We can do a hypothesis test.

  • Null hypothesis: the two variables are completely uncorrelated.
  • Simulate data assuming null hypothesis is true. (shuffle the x and y pairs entirely to cut any possible inferred correlation)
  • Use Pearson correlation, $\rho$, as test statsitic.
  • Compute p-value as fraction of replicates that have $\rho$ at least as large as observed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Compute observed correlation: r_obs (using pearson_r() that was wrote before)
r_obs = pearson_r(illiteracy, fertility)

# Initialize permutation replicates: perm_replicates
perm_replicates = np.empty(10000)

# Draw replicates
for i in range(10000):
# Permute illiteracy measurments: illiteracy_permuted
illiteracy_permuted = np.random.permutation(illiteracy)

# Compute Pearson correlation
perm_replicates[i] = pearson_r(illiteracy_permuted, fertility)

# Compute p-value: p
p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)
print('p-val =', p) #p-val = 0.0

If there are no replicates that is at least as large as observed, it does not mean that the p-value is zero, but means that it needs an enormous amount of replicates to get just one that is extreme enough as specified. Therefore, the p-value is extremely small.

Case Study

EDA

EDA of beak depth
For your first foray into the Darwin finch data, you will study how the beak depth (the distance, top to bottom, of a closed beak) of the finch species Geospiza scandens has changed over time. The Grants have noticed some changes of beak geometry depending on the types of seeds available on the island, and they also noticed that there was some interbreeding with another major species on Daphne Major, Geospiza fortis. These effects can lead to changes in the species over time.

In the next few problems, you will look at the beak depth of G. scandens on Daphne Major in 1975 and in 2012. To start with, let’s plot all of the beak depth measurements in 1975 and 2012 in a bee swarm plot.

The data are stored in a pandas DataFrame called df with columns 'year' and 'beak_depth'. The units of beak depth are millimeters (mm).

1
2
3
4
5
6
7
8
9
# Create bee swarm plot
_ = sns.swarmplot(x = 'year', y = 'beak_depth', data = df)

# Label the axes
_ = plt.xlabel('year')
_ = plt.ylabel('beak depth (mm)')

# Show the plot
plt.show()

ECDFs of beak depths
While bee swarm plots are useful, we found that ECDFs are often even better when doing EDA. Plot the ECDFs for the 1975 and 2012 beak depth measurements on the same plot.

For your convenience, the beak depths for the respective years has been stored in the NumPy arrays bd_1975 and bd_2012.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Compute ECDFs
x_1975, y_1975 = ecdf(bd_1975)
x_2012, y_2012 = ecdf(bd_2012)

# Plot the ECDFs
_ = plt.plot(x_1975, y_1975, marker='.', linestyle='none')
_ = plt.plot(x_2012, y_2012, marker='.', linestyle='none')

# Set margins
plt.margins(0.02)

# Add axis labels and legend
_ = plt.xlabel('beak depth (mm)')
_ = plt.ylabel('ECDF')
_ = plt.legend(('1975', '2012'), loc='lower right')

# Show the plot
plt.show()

Parameter estimates of beak depths
Estimate the difference of the mean beak depth of the G. scandens samples from 1975 and 2012 and report a 95% confidence interval.

Since in this exercise you will use the draw_bs_reps() function you wrote in chapter 2, it may be helpful to refer back to it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Compute the difference of the sample means: mean_diff
mean_diff = np.mean(bd_2012) - np.mean(bd_1975)

# Get bootstrap replicates of means
bs_replicates_1975 = draw_bs_reps(bd_1975, np.mean, 10000)
bs_replicates_2012 = draw_bs_reps(bd_2012, np.mean, 10000)

# Compute samples of difference of means: bs_diff_replicates
bs_diff_replicates = bs_replicates_2012 - bs_replicates_1975

# Compute 95% confidence interval: conf_int
conf_int = np.percentile(bs_diff_replicates, [2.5, 97.5])

# Print the results
print('difference of means =', mean_diff, 'mm')
print('95% confidence interval =', conf_int, 'mm')

Hypothesis test: Are beaks deeper in 2012?
Your plot of the ECDF and determination of the confidence interval make it pretty clear that the beaks of G. scandens on Daphne Major have gotten deeper. But is it possible that this effect is just due to random chance? In other words, what is the probability that we would get the observed difference in mean beak depth if the means were the same?

Be careful! The hypothesis we are testing is not that the beak depths come from the same distribution. For that we could use a permutation test. The hypothesis is that the means are equal. To perform this hypothesis test, we need to shift the two data sets so that they have the same mean and then use bootstrap sampling to compute the difference of means.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Compute mean of combined data set: combined_mean
combined_mean = np.mean(np.concatenate((bd_1975, bd_2012)))

# Shift the samples
bd_1975_shifted = bd_1975 - np.mean(bd_1975) + combined_mean
bd_2012_shifted = bd_2012 - np.mean(bd_2012) + combined_mean

# Get bootstrap replicates of shifted data sets
bs_replicates_1975 = draw_bs_reps(bd_1975_shifted, np.mean, 10000)
bs_replicates_2012 = draw_bs_reps(bd_2012_shifted, np.mean, 10000)

# Compute replicates of difference of means: bs_diff_replicates
bs_diff_replicates = bs_replicates_2012 - bs_replicates_1975

# Compute the p-value
p = np.sum(bs_diff_replicates >= np.mean(bd_2012) - np.mean(bd_1975)) / len(bs_diff_replicates)

# Print p-value
print('p =', p)

EDA of beak length and depth
The beak length data are stored as bl_1975 and bl_2012, again with units of millimeters (mm). You still have the beak depth data stored in bd_1975 and bd_2012. Make scatter plots of beak depth (y-axis) versus beak length (x-axis) for the 1975 and 2012 specimens.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Make scatter plot of 1975 data
_ = plt.plot(bl_1975, bd_1975, marker='.',
linestyle='None', color = 'blue', alpha = 0.5)

# Make scatter plot of 2012 data
_ = plt.plot(bl_2012, bd_2012, marker='.',
linestyle='None', color = 'red', alpha = 0.5)

# Label axes and make legend
_ = plt.xlabel('beak length (mm)')
_ = plt.ylabel('beak depth (mm)')
_ = plt.legend(('1975', '2012'), loc='upper left')

# Show the plot
plt.show()

Linear regressions

Perform a linear regression for both the 1975 and 2012 data. Then, perform pairs bootstrap estimates for the regression parameters. Report 95% confidence intervals on the slope and intercept of the regression line.

You will use the draw_bs_pairs_linreg() function you wrote back in chapter 2.

As a reminder, its call signature is draw_bs_pairs_linreg(x, y, size=1), and it returns bs_slope_reps and bs_intercept_reps. The beak length data are stored as bl_1975 and bl_2012, and the beak depth data is stored in bd_1975 and bd_2012.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Compute the linear regressions
slope_1975, intercept_1975 = np.polyfit(bl_1975, bd_1975, 1)
slope_2012, intercept_2012 = np.polyfit(bl_2012, bd_2012, 1)

# Perform pairs bootstrap for the linear regressions
bs_slope_reps_1975, bs_intercept_reps_1975 = \
draw_bs_pairs_linreg(bl_1975, bd_1975, 1000)
bs_slope_reps_2012, bs_intercept_reps_2012 = \
draw_bs_pairs_linreg(bl_2012, bd_2012, 1000)

# Compute confidence intervals of slopes
slope_conf_int_1975 = np.percentile(bs_slope_reps_1975, [2.5, 97.5])
slope_conf_int_2012 = np.percentile(bs_slope_reps_2012, [2.5, 97.5])
intercept_conf_int_1975 = np.percentile(bs_intercept_reps_1975, [2.5, 97.5])
intercept_conf_int_2012 = np.percentile(bs_intercept_reps_2012, [2.5, 97.5])

# Print the results
print('1975: slope =', slope_1975,
'conf int =', slope_conf_int_1975)
print('1975: intercept =', intercept_1975,
'conf int =', intercept_conf_int_1975)
print('2012: slope =', slope_2012,
'conf int =', slope_conf_int_2012)
print('2012: intercept =', intercept_2012,
'conf int =', intercept_conf_int_2012)

# Make scatter plot of 1975 data
_ = plt.plot(bl_1975, bd_1975, marker='.',
linestyle='none', color='blue', alpha=0.5)

# Make scatter plot of 2012 data
_ = plt.plot(bl_2012, bd_2012, marker='.',
linestyle='none', color='red', alpha=0.5)

# Label axes and make legend
_ = plt.xlabel('beak length (mm)')
_ = plt.ylabel('beak depth (mm)')
_ = plt.legend(('1975', '2012'), loc='upper left')

# Generate x-values for bootstrap lines: x
x = np.array([10, 17])

# Plot the bootstrap lines
for i in range(100):
plt.plot(x, bs_slope_reps_1975[i] * x + bs_intercept_reps_1975[i],
linewidth=0.5, alpha=0.2, color='blue')
plt.plot(x, bs_slope_reps_2012[i] * x + bs_intercept_reps_2012[i],
linewidth=0.5, alpha=0.2, color='red')

# Draw the plot again
plt.show()

The linear regressions showed interesting information about the beak geometry. The slope was the same in 1975 and 2012, suggesting that for every millimeter gained in beak length, the birds gained about half a millimeter in depth in both years. However, if we are interested in the shape of the beak, we want to compare the ratio of beak length to beak depth. Let’s make that comparison.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Compute length-to-depth ratios
ratio_1975 = bl_1975 / bd_1975
ratio_2012 = bl_2012 / bd_2012

# Compute means
mean_ratio_1975 = np.mean(ratio_1975)
mean_ratio_2012 = np.mean(ratio_2012)

# Generate bootstrap replicates of the means
bs_replicates_1975 = draw_bs_reps(ratio_1975, np.mean, 10000)
bs_replicates_2012 = draw_bs_reps(ratio_2012, np.mean, 10000)

# Compute the 99% confidence intervals
conf_int_1975 = np.percentile(bs_replicates_1975, [0.5, 99.5])
conf_int_2012 = np.percentile(bs_replicates_2012, [0.5, 99.5])

# Print the results
print('1975: mean ratio =', mean_ratio_1975,
'conf int =', conf_int_1975)
print('2012: mean ratio =', mean_ratio_2012,
'conf int =', conf_int_2012)

To determine whether this is a real effect or just due to noise, we need to further compute the p-value.

EDA of heritability

The array bd_parent_scandens contains the average beak depth (in mm) of two parents of the species G. scandens. The array bd_offspring_scandens contains the average beak depth of the offspring of the respective parents. The arrays bd_parent_fortis and bd_offspring_fortis contain the same information about measurements from G. fortis birds.

Make a scatter plot of the average offspring beak depth (y-axis) versus average parental beak depth (x-axis) for both species. Use the alpha=0.5 keyword argument to help you see overlapping points.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Make scatter plots
_ = plt.plot(bd_parent_fortis, bd_offspring_fortis,
marker='.', linestyle='none', color='blue', alpha=0.5)
_ = plt.plot(bd_parent_scandens, bd_offspring_scandens,
marker='.', linestyle='none', color='red', alpha=0.5)

# Label axes
_ = plt.xlabel('parental beak depth (mm)')
_ = plt.ylabel('offspring beak depth (mm)')

# Add legend
_ = plt.legend(('G. fortis', 'G. scandens'), loc='lower right')

# Show plot
plt.show()

It appears as though there is a stronger correlation in G. fortis than in G. scandens. This suggests that beak depth is more strongly inherited in G. fortis. We’ll quantify this correlation next.

In an effort to quantify the correlation between offspring and parent beak depths, we would like to compute statistics, such as the Pearson correlation coefficient, between parents and offspring. To get confidence intervals on this, we need to do a pairs bootstrap.

You have already written a function to do pairs bootstrap to get estimates for parameters derived from linear regression. Your task in this exercise is to make a new function with call signature draw_bs_pairs(x, y, func, size=1) that performs pairs bootstrap and computes a single statistic, Pearson correlation on pairs samples defined.

Compute the Pearson correlation coefficient between parental and offspring beak depths for G. scandens. Do the same for G. fortis. Then, use the function you wrote in the last exercise to compute a 95% confidence interval using pairs bootstrap.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def draw_bs_pairs(x, y, func, size=1):
"""Perform pairs bootstrap for a single statistic."""

# Set up array of indices to sample from: inds
inds = np.arange(len(x))

# Initialize replicates: bs_replicates
bs_replicates = np.empty(size)

# Generate replicates
for i in range(size):
bs_inds = np.random.choice(inds, size = len(inds))
bs_x, bs_y = x[bs_inds], y[bs_inds]
bs_replicates[i] = func(bs_x, bs_y)

return bs_replicates

# Compute the Pearson correlation coefficients
r_scandens = pearson_r(bd_parent_scandens, bd_offspring_scandens)
r_fortis = pearson_r(bd_parent_fortis, bd_offspring_fortis)

# Acquire 1000 bootstrap replicates of Pearson r
bs_replicates_scandens = draw_bs_pairs(bd_parent_scandens, bd_offspring_scandens, pearson_r, 1000)

bs_replicates_fortis = draw_bs_pairs(bd_parent_fortis, bd_offspring_fortis, pearson_r, 1000)


# Compute 95% confidence intervals
conf_int_scandens = np.percentile(bs_replicates_scandens, [2.5, 97.5])
conf_int_fortis = np.percentile(bs_replicates_fortis, [2.5, 97.5])

# Print results
print('G. scandens:', r_scandens, conf_int_scandens)
print('G. fortis:', r_fortis, conf_int_fortis)

# G. scandens: 0.4117063629401258 [0.26564228 0.54388972]
# G. fortis: 0.7283412395518487 [0.6694112 0.77840616]

It is clear from the confidence intervals that beak depth of the offspring of G. fortis parents is more strongly correlated with their offspring than their G. scandens counterparts.

Remember that the Pearson correlation coefficient is the ratio of the covariance to the geometric mean of the variances of the two data sets. This is a measure of the correlation between parents and offspring, but might not be the best estimate of heritability. If we stop and think, it makes more sense to define heritability as the ratio of the covariance between parent and offspring to the variance of the parents alone. In this exercise, you will estimate the heritability and perform a pairs bootstrap calculation to get the 95% confidence interval.

This exercise highlights a very important point. Statistical inference (and data analysis in general) is not a plug-n-chug enterprise. You need to think carefully about the questions you are seeking to answer with your data and analyze them appropriately. If you are interested in how heritable traits are, the quantity we defined as the heritability is more apt than the off-the-shelf statistic, the Pearson correlation coefficient.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def heritability(parents, offspring):
"""Compute the heritability from parent and offspring samples."""
covariance_matrix = np.cov(parents, offspring)
return covariance_matrix[0, 1] / covariance_matrix[0, 0]

# Compute the heritability
heritability_scandens = heritability(bd_parent_scandens, bd_offspring_scandens)
heritability_fortis = heritability(bd_parent_fortis, bd_offspring_fortis)

# Acquire 1000 bootstrap replicates of heritability
replicates_scandens = draw_bs_pairs(
bd_parent_scandens, bd_offspring_scandens, heritability, size=1000)

replicates_fortis = draw_bs_pairs(
bd_parent_fortis, bd_offspring_fortis, heritability, size=1000)


# Compute 95% confidence intervals
conf_int_scandens = np.percentile(replicates_scandens, [2.5, 97.5])
conf_int_fortis = np.percentile(replicates_fortis, [2.5, 97.5])

# Print results
print('G. scandens:', heritability_scandens, conf_int_scandens)
print('G. fortis:', heritability_fortis, conf_int_fortis)

Here again, we see that G. fortis has stronger heritability than G. scandens. This suggests that the traits of G. fortis may be strongly incorporated into G. scandens by introgressive hybridization.

Is beak depth heritable at all in G. scandens?
The heritability of beak depth in G. scandens seems low. It could be that this observed heritability was just achieved by chance and beak depth is actually not really heritable in the species. You will test that hypothesis here. To do this, you will do a pairs permutation test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Initialize array of replicates: perm_replicates
perm_replicates = np.empty(10000)

# Draw replicates
for i in range(10000):
# Permute parent beak depths
bd_parent_permuted = np.random.permutation(bd_parent_scandens)
perm_replicates[i] = heritability(bd_parent_permuted, bd_offspring_scandens)


# Compute p-value: p
p = np.sum(perm_replicates >= heritability_scandens) / len(perm_replicates)

# Print the p-value
print('p-val =', p) #p-val = 0.0

You get a p-value of zero, which means that none of the 10,000 permutation pairs replicates you drew had a heritability high enough to match that which was observed. This strongly suggests that beak depth is heritable in G. scandens, just not as much as in G. fortis. If you like, you can plot a histogram of the heritability replicates to get a feel for how extreme of a value of heritability you might expect by chance.