Machine Learning

The Tale of Two Variations: Why NumPy and Pandas Give Different Answers

analyzing a small dataset:

[X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]]

You want to calculate summary statistics to get an idea of ​​the distribution of this data, to use numpy calculating the mean and variance.

import numpy as np

X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]
mean = np.mean(X)
var = np.var(X)

print(f"Mean={mean:.2f}, Variance={var:.2f}")

The output looks like this:

Mean=10.00, Variance=10.60

Good! Now you have an idea of ​​the distribution of your data. However, your colleague comes along and tells you to recalculate the summary statistics on the same dataset using the following code:

import pandas as pd

X = pd.Series([15, 8, 13, 7, 7, 12, 15, 6, 8, 9])
mean = X.mean()
var = X.var()

print(f"Mean={mean:.2f}, Variance={var:.2f}")

Their result looks like this:

Mean=10.00, Variance=11.78

The methods are the same, but the variety is different! What gives?

This conflict arises because numpy again pandas use different default values ​​to calculate the variance of the array. This article will mathematically describe the two variables, explain why they differ, and show how to use either equation in different numerical libraries.


Two Definitions

There are two general methods of calculating variance, each intended for a different purpose. It depends on whether you count all the differences number of people (the whole group you are studying) or just a a sample (a subset of that population for which you have data).

I diversity of people, σ2sigma^2defined as:

[sigma^2 = frac{sum_{i=1}^N(x_i-mu)^2}{N}]

While i sample variance, s2s^2defined as:

[s^2 = frac{sum_{i=1}^n(x_i-bar x)^2}{n-1}]

(Note: xix_i represents each data point in your dataset. NN represents the total number of data points in the population, nn represents the value of the number of data points in the sample, and xbar x is an example).

Note the big difference between these figures:

  1. In total, σ2sigma^2 calculated using population, μmuwhile s2s^2 calculated using the sample mean, xbar x.
  2. In the denominator, σ2sigma^2 divided by the total population NNwhile s2s^2 dividing by the sample size minus one, n1n-1.

It should be noted that the difference between these two definitions is more significant for small sample sizes. As nn you grow, the difference between nn again n1n-1 it is getting less and less.


Why Are They Different?

When calculating population variance, it is assumed that you have all the data. Do you know the exact center (population μmu) and how far each point is from that center. Divides by the total number of data points NN it gives a true, exact measure of that squared difference.

However, when calculating the sample variance, it is not assumed that you have all the data so you may not have the true population. μmu. Instead, you only have a measure of that μmuwhich is an example of a definition xbar x. However, it turns out that using the sample mean instead of the actual population often underestimates the variance of the population mean.

This happens because the sample mean is calculated directly from the sample data, which means that it resides in the exact statistical center of that particular sample. As a result, the data points in your sample will always be closer to their sample mean than they are in the actual population, resulting in an artificially small sum of squared differences.

To correct this underestimation, we use the so-called Bessel's correction (named after the German mathematician Friedrich Wilhelm Bessel), where we do not differentiate by nnbut slowly n1n-1 correcting this bias, since dividing by a smaller number makes the final difference slightly larger.

Degrees of Freedom

So why did you divide by n1n-1 and no n2n-2 or n3n-3 any other adjustments that increase the final variation? That comes down to a concept called Degrees of Freedom.

Degrees of freedom refer to the number of independent values ​​in the equation that are free to vary. For example, imagine you have a set of 3 numbers, (x1,x2,x3)(x_1, x_2, x_3). You don't know what the prices of these things are but you know what their samples mean x=10bar x = 10.

  • Number one x1x_1 can be anything (say 8)
  • Number two x2x_2 can be anything (say 15)
  • Because the ratio must be 10, x3x_3 is not free to vary and must be one such number x=10bar x = 10which is 7 this time.

So in this example, although there are 3 numbers, there are only two degrees of freedom, since forcing the sample means removing the ability of one of them to be free to vary.

In the case of differences, before doing any calculations, we start with it nn degrees of freedom (corresponding to our nn data points). The calculation of the sample mean xbar x it actually uses one degree of freedom, so when the sample variance is calculated, there is n1n-1 degrees of freedom are left to work with, that's why n1n-1 is what appears in the denominator.


Library Automation and How to Combine It

Now that we understand the math, we can finally solve this mystery from the beginning of the article! numpy again pandas it gave different results because it happened in different different formulas.

Many code libraries control this using a parameter called ddofwhich is Delta Degrees of Freedom. This represents the value subtracted from the total number of observations in the denominator.

  • Setting up ddof=0 divides the equation by nnto count i diversity of people.
  • Setting up ddof=1 divides the equation by n1n-1to count i sample variance.

This can also be used when calculating the standard deviation, which is simply the square root of the variance.

Here's an explanation of how various popular libraries handle this default and how you can override it:

numpy

By mistake, numpy you think you're counting population differences (ddof=0). If you are working with a sample and need to use Bessel correction, you should obviously succeed ddof=1.

import numpy as np
X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]          

# Sample variance and standard deviation
np.var(X, ddof=1)
np.std(X, ddof=1)

# Population variance and standard deviation (Default)
np.var(X)
np.std(X)

the pandas

By mistake, pandas it takes the opposite direction. It assumes that your data is a sample and calculates the sample variance (ddof=1). To calculate the population variance instead, you have to pass ddof=0.

import pandas as pd
X = pd.Series([15, 8, 13, 7, 7, 12, 15, 6, 8, 9])

# Sample variance and standard deviation (Default)
X.var()
X.std()          

# Population variance and standard deviation 
X.var(ddof=0)
X.std(ddof=0)

Built-in Python statistics Module

The Python standard library does not use ia ddof parameter. Instead, it provides clearly named functions so there is no confusion about which formula is used.

import statistics
X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]

# Sample variance and standard deviation
statistics.variance(X)
statistics.stdev(X)  

# Population variance and standard deviation
statistics.pvariance(X)
statistics.pstdev(X)

R

In R, the standard var() again sd() functions calculate the sample variance and sample standard deviation automatically. Unlike Python libraries, R has no built-in arguments for easy conversion to human formulas. To calculate the population variance, you must multiply the sample variance by the mean n1nfrac{n-1}{n}.

X <- c(15, 8, 13, 7, 7, 12, 15, 6, 8, 9)
n <- length(X)

# Sample variance and standard deviation (Default)
var(X)
sd(X)

# Population variance and standard deviation
var(X) * ((n - 1) / n)
sd(X) * sqrt((n - 1) / n)

The conclusion

This article explored a frustrating but often overlooked feature of various statistical programming languages ​​and libraries – they choose to use different default definitions for variance and standard deviation. An example was given where in the input array, numpy again pandas return different values ​​of the variable automatically.

This comes down to the difference between how variance should be calculated for the entire study population versus how variance should be calculated based on just a sample of that population, with different libraries making different choices about defaults. Finally it was shown that although each library has defaults, all of them can be used to calculate both types of variables by using or ddof argument, a slightly different function, or by a simple mathematical transformation.

Thanks for reading!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button