# statistics — Statistical Calculations¶

Purpose: | Implementations of common statistical calculations. |
---|

The `statistics`

module implements many common statistical
formulas for efficient calculations using Python’s various numerical
types (`int`

, `float`

, `Decimal`

, and
`Fraction`

).

## Averages¶

There are three forms of “averages” supported, the mean, the median,
and the mode. Calculate the arithmetic mean with `mean()`

.

```
from statistics import *
data = [1, 2, 2, 5, 10, 12]
print('{:0.2f}'.format(mean(data)))
```

The return value for integers and floats is always a float. For
`Decimal`

and `Fraction`

input data, the result is of
the same type as the inputs.

```
$ python3 statistics_mean.py
5.33
```

Calculate the most common data point in a data set using `mode()`

.

```
from statistics import *
data = [1, 2, 2, 5, 10, 12]
print(mode(data))
```

The return value is always a member of the input data set. Because
`mode()`

treats the input as a set of discrete values, and counts
the recurrences, the inputs do not actually need to be numerical
values.

```
$ python3 statistics_mode.py
2
```

There are four variations for calculating the median, or middle, value. The first three are straightforward versions of the usual algorithm, with different solutions for handling data sets with an even number of elements.

```
from statistics import *
data = [1, 2, 2, 5, 10, 12]
print('median : {:0.2f}'.format(median(data)))
print('low : {:0.2f}'.format(median_low(data)))
print('high : {:0.2f}'.format(median_high(data)))
```

`median()`

finds the center value, and if the data set has an even
number of values it averages the two middle items. `median_low()`

always returns a value from the input data set, using the lower of the
two middle items for data sets with an even number of
items. `median_high()`

similarly returns the higher of the two
middle items.

```
$ python3 statistics_median.py
median : 3.50
low : 2.00
high : 5.00
```

The fourth version of the median calculation, `median_grouped()`

,
treats the inputs as continuous data and calculates the 50% percentile
median by first finding the median range using the provided interval
width and then interpolating within that range using the position of
the actual value(s) from the data set that fall in that range.

```
from statistics import *
data = [10, 20, 30, 40]
print('1: {:0.2f}'.format(median_grouped(data, interval=1)))
print('2: {:0.2f}'.format(median_grouped(data, interval=2)))
print('3: {:0.2f}'.format(median_grouped(data, interval=3)))
```

As the interval width increases, the median computed for the same data set changes.

```
$ python3 statistics_median_grouped.py
1: 29.50
2: 29.00
3: 28.50
```

## Variance¶

Statistics uses two values to express how disperse a set of values is
relative to the mean. The *variance* is the average of the square of
the difference of each value and the mean, and the *standard
deviation* is the square root of the variance (which is useful because
taking the square root allows the standard deviation to be expressed
in the same units as the input data). Large values for variance or
standard deviation indicate that a set of data is disperse, while
small values indicate that the data is clustered closer to the mean.

```
from statistics import *
import subprocess
def get_line_lengths():
cmd = 'wc -l ../[a-z]*/*.py'
out = subprocess.check_output(
cmd, shell=True).decode('utf-8')
for line in out.splitlines():
parts = line.split()
if parts[1].strip().lower() == 'total':
break
nlines = int(parts[0].strip())
if not nlines:
continue # skip empty files
yield (nlines, parts[1].strip())
data = list(get_line_lengths())
lengths = [d[0] for d in data]
sample = lengths[::2]
print('Basic statistics:')
print(' count : {:3d}'.format(len(lengths)))
print(' min : {:6.2f}'.format(min(lengths)))
print(' max : {:6.2f}'.format(max(lengths)))
print(' mean : {:6.2f}'.format(mean(lengths)))
print('\nPopulation variance:')
print(' pstdev : {:6.2f}'.format(pstdev(lengths)))
print(' pvariance : {:6.2f}'.format(pvariance(lengths)))
print('\nEstimated variance for sample:')
print(' count : {:3d}'.format(len(sample)))
print(' stdev : {:6.2f}'.format(stdev(sample)))
print(' variance : {:6.2f}'.format(variance(sample)))
```

Python includes two sets of functions for computing variance and
standard deviation, depending on whether the data set represents the
entire population or a sample of the population. This example uses
`wc`

to count the number of lines in the input files for all of the
example programs and then uses `pvariance()`

and `pstdev()`

to
compute the variance and standard deviation for the entire population
before using `variance()`

and `stddev()`

to compute the sample
variance and standard deviation for a subset created by using the
length of every second file found.

```
$ python3 statistics_variance.py
Basic statistics:
count : 959
min : 4.00
max : 228.00
mean : 28.62
Population variance:
pstdev : 18.52
pvariance : 342.95
Estimated variance for sample:
count : 480
stdev : 21.09
variance : 444.61
```

See also

- Standard library documentation for statistics
- mathtips.com: Median for Discrete and Continuous Frequency Type Data (grouped data) – Discussion of median for continuous data
**PEP 450**– Adding A Statistics Module To The Standard Library