Udacity - A/B Testing

Udacity course notes on A/B Testing provided by Google.

Lesson 1: Overview of A/B Testing

Definition

A/B testing is a general methodology used online when you want to test out a new product or feature. You take two sets of users and

  • show the control set your existing product or feature, and
  • show your experiment the new version
    Then you need to determine which version of your feature or product is better based on the different performance of these two sets.

When do we use A/B Testing

  • New features
  • Addition to UI
  • Different website appearance
    Sometimes these tests may go a little bit too far: Google tested 41 different shades of blue.

Or less user visible changes:

  • Content ranking changes: first show a news article or an encouragement to add new contacts (LinkedIn)
  • New ranking algorithms for movie recommendation site
  • Change backend settings like page load time, results users see, etc.

When is the A/B testing not as useful

A/B testing is useful for helping you to climb to the peak of the current mountain, but if you want to figure out whether you want to be on this mountain or another mountain, A/B testing isn’t so useful.

That is to say, A/B testing is not as useful when testing out new experiences.

Reasons:

  1. Change Aversion: Users may react to the change in their own way, simply because they are more used to the old version.
  2. Novelty Effect: People blindly test out every new things
  3. It cannot tell you whether you are missing something: It can tell you whether A is better than B, but it cannot tell you if there should be a C that can actually outperforms both A and B.

In a new experience, there are two kinds of issues:

  1. What is your baseline for comparison?
  2. How long does it take to actually have your users adapt to the new experience, so that you can really capture the plateau experience and make a robust decision? (like returning customers. For websites whose customers do not come back that often, it is difficult to observe the effect of referral or returning customers in the short run)

There are things that you want to know that are difficult to be tested out via short-term A/B testing

Examples of times when we should not using A/B testing

  • Online shopping company want to know if their website is complete and carries all the products that the customers want to see.

We can try to carry a specific additional product and use A/B testing to check the effect, but if the users don’t buy it, you still won’t know if there is a different product that you are missing. Instead, we can simply try to ask the users whether they feeling there is anything missing.

  • Add premium service which needs users to upgrade, log in and do a bunch of things.

Users have to opt into a permium service, so you cannot randomly assigning people to one group or another. However, you can still gather useful information through an A/B test. For example, you can see how many users will read the new features, or how many will choose to upgrade if you make the choice available. But you cannot fully test out the change

  • Website selling cars want to know whether a change will increase repeating customers or referrals

Since people don’t buy cars that often, it would take too long to see if you get repeating customers, and you don’t necessarily even have data about whther customers are recommending the site to their friends.

  • Changing the branding, including main logo (tricky)

Although it seems that you can easily apply A/B testing on this issue, changing the branding of your company can be surprisingly emotional for your users. They might need some time to get used to the new logo, so you wouldn’t want to make a decision based on a short time window of data collected in an A/B test

Other techniques (more details in lesson 3)

When A/B testing is not that useful, we can use other techiniques to compliment the A/B test.

  • Analyze the user activity logs on your website observationally or retrospectively, to see if a hypothesis can be developed about what’s causing changes in their behavior, and then go forward to design an perspective analysis to test out the thing that you discovered, to see if this can be realized. And then you can use A/B testing to compare the results of different analysis.

  • Deep and qualitative insights (vs. broad and quantitative results from A/B tests): tells you which mountain to go before thinking about how to reach the peak using A/B testing. e.g. user experience research, focus groups, surveys, human evaluation

History of A/B Testing

The concept of A/B testing originated from agriculture, where people divide their land into sections and test the performance of different crops and nurturing methods. There are also lots of other fields that have been using A/B testing for a long time, like clinical trials in medicine.

The key thing in A/B testing: having a consistent response from your control and your experiment group, so that you can determine and structure the experiment to decide whether there’s a significant behavior change in your experiment group as opposed to in your control group.

The difference between these tests and online A/B testing is that the data online is of large quantity but low resolution. That is to say, the volume is big but you don’t know too much details about each data point. For traditional user experience research or hypothesis tests, you may have limited number of participants but sufficient information about each of them.

Another difference is that for online A/B testing, the goal is really about deciding whether this new product or feature is something that users will like. Therefore, when desigining the test, you have to make it robust and give you repeatable results, so that you can actually make good decisions about your new product or feature.

An A/B Test Example for This Lesson

Background Overview

Audacity, an online education company that specially focus on online finance courses, is testing features that can increase student engagement.

Typical user flow at Audacity: (Customer Funnel)

  1. Homepage visit: Largest group of people
  2. Exploring the site: click into the courses
  3. Create an account: become a member
  4. Complete: make a purchase, complete a class, finish a series of classes, or share the site on their blog, etc.

You have the largest number of events happening at the top of your funnel, and as you go down, people who reach that level becomes rarer and rarer.

Actually, the users don’t strictly follow the funnel from up to down in a consistent manner. There is a lot of back and forth swirl between the different states, and also repeat visitors who skip over intermediate steps. (discussed in more details in lesson 3)

Analysis

Let’s now consider a experimental change to Audacity’s homepage.

Defining Hypothesis

Preliminary hypothesis: Changing the ‘start now’ button from orange to pink will increase how many students explore Audacity’s courses. (move on to the 2nd step of the above funnel)

We need to refine our hypothesis with a reasonable and quantifiable metric.

Choosing a metric

  1. Total number of courses completed. (too long time)
    This is what Audacity ultimately cares about. However, given that it can take students weeks or months to finish courses, using this metric would simply take too much time to be practical.

  2. Number of clicks
    The assumption here is that if more people click on the button, and thus move on to exploring the site, then eventually some of them will create an account and go on to complete a course. In other words, increasing the rate at which users progress down the funnel at one level, will have a positive impact on the end of the funnel as well.

    • Problems: If more total users view the page in one version of the experiment, the count of clicks would not be sufficient enough to prove the case.
  3. $\frac{Number of Clicks}{Number of Page Views}$ or Click-through-rate (CTR)

  4. $\frac{Number of unique visitors who click}{Numebr of unique visitors to page}$ or Click-through-probability
    • What’s the different between 3 and 4?
      Consider the case that two people, A and B, visit the website. A leaves without clicking the button, while B clicks the button 5 times impatiently due to long wait time for the next page.
      $$Click-through-rate (CTR) = \frac{0 + 5}{2} = 2.5$$
      $$Click-through-probability = \frac{0 + 1}{2} = 0.5$$ Half of the users who visited the page clicked.
    • When to use rate and when to use probability?
      Generally speaking, you use a rate to measure the usability of the site, and a probability when you want to measure the total impact.
      For example, if you want to measure the usability of a particular button, you should use a rate, because the users have a variety of different places on the page that they can choose to click on, and the rate simply tells you how often do they actually find the button.
      If you want to know how often users when to the second level page on your site, you use a probability, because you only care about the unique visitors that enter to the next level
    • How to compute the rate or probability?
      You need to communicate with the engineers and ask them to modify the website so that you can capture the data you need.
      For calculating the rate, you need to capture events when a user view the page and click on the button, and then divide their sum.
      For calculating the rate, you have to match each page view with all of the child clicks, so that you can count, at most, one child click per page view.

Updated Hypothesis: Changing the ‘start now’ button from orange to pink will increase the click-through-probability of the button.
We assume here that the increase of this metric will ultimately increase our final business metric, which is total courses completed

Review the Statistics

Since:

  1. the event has two exclusive outcomes: click and no click,
  2. the events are independent (one click does not tell you anything about whether there is a click for the next visitor),
  3. The events follow indentical distribution (the probabilities of success are the same)
    and we are looking for a probability based on repeat events here, the binomial distribution will serve the case.

Next, we should use the characteristics of binomial distribution to flag the results that should surprise us.