4 Ways to Improve Math Skills

In A/B testing, you usually have to measure statistical power and how long the test takes. Learn how allocation, effect size, CUPED & Binarization can help you.

In A/B testing, you usually have to measure statistical power and how long the test takes. You want a robust test that can get any results, which means you need a lot of users. This makes the test longer to obtain sufficient statistical power. But, you also need a short test so that the company can “go” quickly, introduce new features and optimize existing ones.
Fortunately, testing height isn't the only way to gain the power you desire. In this article, I will show you some ways that analysts can achieve the power they desire without making the test too long. But before getting down to business, a little theory ('cause sharing cares).
Mathematical Strengths: Importance and Contributing Factors
Mathematical reasoning, especially hypothesis testing, is how we analyze different versions of our product. This method considers two possible situations: either the new version is different from the old one, or they are the same. We start by assuming that both versions are the same and only change this view if the data suggest otherwise.
However, mistakes can happen. We may think there is a difference when there isn't, or we may miss a difference when there is. The second type of error is called a Type II error, and it is related to the concept of statistical power. Statistical power measures the probability of NOT making a Type II error, meaning it indicates how likely we are to find real differences between the versions if they exist. Having high power in the test is important because low power means that we are less likely to find a true effect between versions.
There are several factors that influence energy. To get some sense, let's consider the two scenarios shown below. Each graph shows the income distribution of the two versions. In which situation do you think there is a higher power? Where are we likely to find differences between versions?
The main intuition about power lies in the distributional distribution. Greater differentiation improves our ability to see results. Therefore, while both scenarios show the income of version 2 surpasses version 1, Scenario B shows a higher power to see the difference between the two versions. The extent of overlap between distributions depends on two main parameters:
- Variance: Variance shows the variance of the dependent variable. Users are naturally different, which leads to differences. As the difference increases, the overlap between the versions becomes stronger, the power decreases.
- Effect size: Effect size indicates the difference between the distributions of the dependent variable. As the size of the effect increases, and the gap between the means of distribution widens, the overlap decreases, the strengthening force.
So how can you maintain the power level you want without increasing sample sizes or extending your tests? Keep reading.
Allocation
When planning your A/B test, how you assign users between control and treatment groups can have a significant impact on the statistical power of your test. If you divide users evenly between control and treatment groups (eg, 50/50), you increase the number of data points in each group over the required time period. This balance helps in finding differences between groups because both have enough users to provide reliable data. On the other hand, if you assign users unevenly (eg, 90/10), a group with fewer users may not have enough data to show a significant effect within the required time frame, reducing statistical power.
To illustrate, consider this: if a test requires 115K users with a 50%-50% share to reach a power level of 80%, switching to 90%-10% will require 320K users, and therefore will extend the runtime of examination. to achieve the same 80% power level.
However, allocation decisions should not completely ignore business needs. Two main conditions would favor unequal distribution:
- There are concerns that the new version could seriously damage the company's performance. In such cases, starting with an unequal share, such as 90%-10%, and later switching to an equal share is recommended.
- During one-time events, such as Black Friday, it is important to take advantage of the treatment. For example, treating 90% of the population while leaving 10% untreated allows learning about effect size.
Therefore, the decision regarding the allocation of the group should take into account both the statistical advantages and the business objectives, bearing in mind that the equal allocation leads to a more powerful test and gives a greater chance of obtaining improvements.
Effect Size
The power of a test is intricately related to its Minimum Detectable Effect (MDE): if the test is designed to test small effects, the probability of finding these effects will be small (resulting in low power). Consequently, in order to maintain adequate power, data analysts must compensate for small MDEs by increasing the duration of the test.
This trade-off between MDE and test time plays an important role in determining the sample size required to achieve a certain level of power in an experiment. Although many analysts understand that larger MDEs require smaller sample sizes and shorter performance times (and vice versa), they often fail to recognize the indirect nature of this relationship.
Why is this important? What the non-linear relationship implies is that any increase in MDE yields a disproportionately large gain in terms of sample size. Let's put the math aside for a second. and look at the following example: if the base conversion rate in our test is 10%, a 5% MDE would require 115.5K users. In contrast, an MDE of 10% would only require 29.5K users. In other words, with a doubling of the MDE, we achieved an almost 4-fold reduction in sample size! In your face, linearity.
Actually, this is important if you have time problems. AKA always. In such cases, I suggest that clients consider increasing the result in testing, such as providing a higher bonus to users. This naturally increases the MDE due to the large expected effect, thereby significantly reducing the required test time for the same power level. Although such decisions must be consistent with business objectives, if they work, they provide direct and effective ways to ensure testing capabilities, even under operational time constraints.
Variation reduction (CUPED)
One of the factors that have the most impact on the strength analysis is the Key Performance Indicator (KPI) variable. The larger the difference, the longer the test is needed to achieve the pre-defined power level. Therefore, if it is possible to reduce the variance, it is also possible to achieve the required power with a shorter test duration.
Another way to reduce variation is CUPED (Controlled Evaluation Using Pre-Examination Data). The idea behind this approach is to use pre-test data to reduce variance and isolate the impact of variance. For a little perspective, let's imagine a situation (not particularly realistic…) where a transition to a new variable causes each user to spend 10% more than they have so far. Let's say we have three users who have spent 100, 10, 1 dollars so far. With the new variant, these users will spend 110, 11, 1.1 dollars. The idea of using past data is to subtract the historical data of each user from the current data, resulting in the difference between the two, that is, 10, 1, 0.1. We don't need to go into detailed calculations to see that the variance is much higher in the original data compared to the difference data. If you insist, we'll reveal that we actually reduced the variance by a factor of 121 just by using the data we've already collected!
In the last example, we simply extract each user's past data from the current data. The implementation of CUPED is more complex and takes into account the correlation between current data and past data. In any case, the idea is the same: by using historical data, we can reduce the variation between users and isolate the variation caused by the new variation.
To use CUPED, you need to have historical data for each user, and it should be possible to identify each user in a new test. Although these requirements are not always met, in my experience, they are more common in some companies and industries, eg sports, SAAS, etc. In such cases, using CUPED can be very important for both test planning and data analysis. In this way, at least, learning history can really create a better future.
Binaryization
KPIs broadly fall into two categories: continuous and binary. Each type carries its own qualities. The benefit of continuous KPIs is the depth of information they provide. Unlike binary KPIs, which provide a simple yes or no, continuous KPIs have quantitative and qualitative details in the data. A clear illustration of this difference can be seen by comparing “paying user” and “revenue.” While paying users show a binary result – paid or not – revenue shows the actual amount spent.
But what about the benefits of binary KPI? Despite holding less information, its limited scope leads to less variation. And if you've been following along this far, you know that reduced variance tends to increase statistical power. So, posting a binary KPI requires several users to get the result with the same level of power. This can be very valuable if there are problems during the test.
So, which is superior – binary or continuous KPI? However, it is complicated. If the company is facing problems during the evaluation, using binary KPI for planning can provide an effective solution. However, the main concern is whether a binary KPI will provide a satisfactory answer to a business question. In certain cases, the company may decide that a new version is better if it improves paying users; in others, it may choose to base version changes on broader data, such as revenue optimization. Therefore, making both variables continuous can help us manage the time constraints of the test, but it requires smart use.
Conclusions
In this article, we explored several simple but powerful ways to improve power without extending the duration of the test. By understanding the importance of key parameters such as allocation, MDE, and selected KPIs, data analysts can use specific strategies to increase the effectiveness of their testing efforts. This, in turn, enables increased data collection and provides deeper insights into their product.