statistics — Statistical Calculations

Purpose:Implementations of common statistical calculations.

The statistics module implements many common statistical formulas for efficient calculations using Python’s various numerical types (int, float, Decimal, and Fraction).

Averages

There are three forms of averages supported, the mean, the median, and the mode. Calculate the arithmetic mean with mean().

statistics_mean.py
from statistics import *

data = [1, 2, 2, 5, 10, 12]

print('{:0.2f}'.format(mean(data)))

The return value for integers and floats is always a float. For Decimal and Fraction input data, the result is of the same type as the inputs.

$ python3 statistics_mean.py

5.33

Calculate the most common data point in a data set using mode().

statistics_mode.py
from statistics import *

data = [1, 2, 2, 5, 10, 12]

print(mode(data))

The return value is always a member of the input data set. Because mode() treats the input as a set of discrete values, and counts the recurrences, the inputs do not actually need to be numerical values.

$ python3 statistics_mode.py

2

There are four variations for calculating the median, or middle, value. The first three are straightforward versions of the usual algorithm, with different solutions for handling data sets with an even number of elements.

statistics_median.py
from statistics import *

data = [1, 2, 2, 5, 10, 12]

print('median     : {:0.2f}'.format(median(data)))
print('low        : {:0.2f}'.format(median_low(data)))
print('high       : {:0.2f}'.format(median_high(data)))

median() finds the center value, and if the data set has an even number of values it averages the two middle items. median_low() always returns a value from the input data set, using the lower of the two middle items for data sets with an even number of items. median_high() similarly returns the higher of the two middle items.

$ python3 statistics_median.py

median     : 3.50
low        : 2.00
high       : 5.00

The fourth version of the median calculation, median_grouped(), treats the inputs as continuous data and calculates the 50% percentile median by first finding the median range using the provided interval width and then interpolating within that range using the position of the actual value(s) from the data set that fall in that range.

statistics_median_grouped.py
from statistics import *

data = [10, 20, 30, 40]

print('1: {:0.2f}'.format(median_grouped(data, interval=1)))
print('2: {:0.2f}'.format(median_grouped(data, interval=2)))
print('3: {:0.2f}'.format(median_grouped(data, interval=3)))

As the interval width increases, the median computed for the same data set changes.

$ python3 statistics_median_grouped.py

1: 29.50
2: 29.00
3: 28.50

Variance

Statistics uses two values to express how disperse a set of values is relative to the mean. The variance is the average of the square of the difference of each value and the mean, and the standard deviation is the square root of the variance (which is useful because taking the square root allows the standard deviation to be expressed in the same units as the input data). Large values for variance or standard deviation indicate that a set of data is disperse, while small values indicate that the data is clustered closer to the mean.

statistics_variance.py
from statistics import *
import subprocess


def get_line_lengths():
    cmd = 'wc -l ../[a-z]*/*.py'
    out = subprocess.check_output(
        cmd, shell=True).decode('utf-8')
    for line in out.splitlines():
        parts = line.split()
        if parts[1].strip().lower() == 'total':
            break
        nlines = int(parts[0].strip())
        if not nlines:
            continue  # skip empty files
        yield (nlines, parts[1].strip())


data = list(get_line_lengths())

lengths = [d[0] for d in data]
sample = lengths[::2]

print('Basic statistics:')
print('  count     : {:3d}'.format(len(lengths)))
print('  min       : {:6.2f}'.format(min(lengths)))
print('  max       : {:6.2f}'.format(max(lengths)))
print('  mean      : {:6.2f}'.format(mean(lengths)))

print('\nPopulation variance:')
print('  pstdev    : {:6.2f}'.format(pstdev(lengths)))
print('  pvariance : {:6.2f}'.format(pvariance(lengths)))

print('\nEstimated variance for sample:')
print('  count     : {:3d}'.format(len(sample)))
print('  stdev     : {:6.2f}'.format(stdev(sample)))
print('  variance  : {:6.2f}'.format(variance(sample)))

Python includes two sets of functions for computing variance and standard deviation, depending on whether the data set represents the entire population or a sample of the population. This example uses wc to count the number of lines in the input files for all of the example programs and then uses pvariance() and pstdev() to compute the variance and standard deviation for the entire population before using variance() and stddev() to compute the sample variance and standard deviation for a subset created by using the length of every second file found.

$ python3 statistics_variance.py

Basic statistics:
  count     : 1282
  min       :   4.00
  max       : 228.00
  mean      :  27.79

Population variance:
  pstdev    :  17.86
  pvariance : 319.04

Estimated variance for sample:
  count     : 641
  stdev     :  16.94
  variance  : 286.99

See also