Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-36018: Add the NormalDist class to the statistics module #11973

Merged
merged 16 commits into from
Feb 23, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 195 additions & 0 deletions Doc/library/statistics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,201 @@ A single exception is defined:

Subclass of :exc:`ValueError` for statistics-related exceptions.


:class:`NormalDist` objects
===========================

A :class:`NormalDist` is a a composite class that treats the mean and standard
deviation of data measurements as a single entity. It is a tool for creating
and manipulating normal distributions of a random variable.

Normal distributions arise from the `Central Limit Theorem
<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
of applications in statistics, including simulations and hypothesis testing.

.. class:: NormalDist(mu=0.0, sigma=1.0)

Returns a new *NormalDist* object where *mu* represents the `arithmetic
mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of data and *sigma*
represents the `standard deviation
<https://en.wikipedia.org/wiki/Standard_deviation>`_ of the data.

If *sigma* is negative, raises :exc:`StatisticsError`.

.. attribute:: mu

The mean of a normal distribution.

.. attribute:: sigma

The standard deviation of a normal distribution.

.. attribute:: variance

A read-only property representing the `variance
<https://en.wikipedia.org/wiki/Variance>`_ of a normal
distribution. Equal to the square of the standard deviation.

.. classmethod:: NormalDist.from_samples(data)

Class method that makes a normal distribution instance
from sample data. The *data* can be any :term:`iterable`
and should consist of values that can be converted to type
:class:`float`.

If *data* does not contain at least two elements, raises
:exc:`StatisticsError` because it takes at least one point to estimate
a central value and at least two points to estimate dispersion.

.. method:: NormalDist.samples(n, seed=None)

Generates *n* random samples for a given mean and standard deviation.
Returns a :class:`list` of :class:`float` values.

If *seed* is given, creates a new instance of the underlying random
number generator. This is useful for creating reproducible results,
even in a multi-threading context.

.. method:: NormalDist.pdf(x)

Using a `probability density function (pdf)
<https://en.wikipedia.org/wiki/Probability_density_function>`_,
compute the relative likelihood that a random sample *X* will be near
the given value *x*. Mathematically, it is the ratio ``P(x <= X <
x+dx) / dx``.

Note the relative likelihood of *x* can be greater than `1.0`. The
probability for a specific point on a continuous distribution is `0.0`,
so the :func:`pdf` is used instead. It gives the probability of a
sample occurring in a narrow range around *x* and then dividing that
probability by the width of the range (hence the word "density").

.. method:: NormalDist.cdf(x)

Using a `cumulative distribution function (cdf)
<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
compute the probability that a random sample *X* will be less than or
equal to *x*. Mathematically, it is written ``P(X <= x)``.

Instances of :class:`NormalDist` support addition, subtraction,
multiplication and division by a constant. These operations
are used for translation and scaling. For example:

.. doctest::

>>> temperature_february = NormalDist(5, 2.5) # Celsius
>>> temperature_february * (9/5) + 32 # Fahrenheit
NormalDist(mu=41.0, sigma=4.5)

Dividing a constant by an instance of :class:`NormalDist` is not supported.

Since normal distributions arise from additive effects of independent
variables, it is possible to `add and subtract two normally distributed
random variables
<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
represented as instances of :class:`NormalDist`. For example:

.. doctest::

>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
>>> drug_effects = NormalDist(0.4, 0.15)
>>> combined = birth_weights + drug_effects
>>> f'mu={combined.mu :.1f} sigma={combined.sigma :.1f}'
'mu=3.1 sigma=0.5'

.. versionadded:: 3.8


:class:`NormalDist` Examples and Recipes
----------------------------------------

A :class:`NormalDist` readily solves classic probability problems.

For example, given `historical data for SAT exams
<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
are normally distributed with a mean of 1060 and standard deviation of 192,
determine the percentage of students with scores between 1100 and 1200:

.. doctest::

>>> sat = NormalDist(1060, 195)
>>> fraction = sat.cdf(1200) - sat.cdf(1100)
>>> f'{fraction * 100 :.1f}% score between 1100 and 1200'
'18.2% score between 1100 and 1200'

To estimate the distribution for a model than isn't easy to solve
analytically, :class:`NormalDist` can generate input samples for a `Monte
Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_ of the
model:

.. doctest::

>>> n = 100_000
>>> X = NormalDist(350, 15).samples(n)
>>> Y = NormalDist(47, 17).samples(n)
>>> Z = NormalDist(62, 6).samples(n)
>>> model_simulation = [x * y / z for x, y, z in zip(X, Y, Z)]
>>> NormalDist.from_samples(model_simulation) # doctest: +SKIP
NormalDist(mu=267.6516398754636, sigma=101.357284306067)

Normal distributions commonly arise in machine learning problems.

Wikipedia has a `nice example with a Naive Bayesian Classifier
<https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_. The challenge
is to guess a person's gender from measurements of normally distributed
features including height, weight, and foot size.

The `prior probability <https://en.wikipedia.org/wiki/Prior_probability>`_ of
being male or female is 50%:

.. doctest::

>>> prior_male = 0.5
>>> prior_female = 0.5

We also have a training dataset with measurements for eight people. These
measurements are assumed to be normally distributed, so we summarize the data
with :class:`NormalDist`:

.. doctest::

>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])

We observe a new person whose feature measurements are known but whose gender
is unknown:

.. doctest::

>>> ht = 6.0 # height
>>> wt = 130 # weight
>>> fs = 8 # foot size

The posterior is the product of the prior times each likelihood of a
feature measurement given the gender:

.. doctest::

>>> posterior_male = (prior_male * height_male.pdf(ht) *
... weight_male.pdf(wt) * foot_size_male.pdf(fs))

>>> posterior_female = (prior_female * height_female.pdf(ht) *
... weight_female.pdf(wt) * foot_size_female.pdf(fs))

The final prediction is awarded to the largest posterior -- this is known as
the `maximum a posteriori
<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:

.. doctest::

>>> 'male' if posterior_male > posterior_female else 'female'
'female'


..
# This modelines must appear within the last ten lines of the file.
kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
26 changes: 26 additions & 0 deletions Doc/whatsnew/3.8.rst
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,32 @@ Added :func:`statistics.fmean` as a faster, floating point variant of
:func:`statistics.mean()`. (Contributed by Raymond Hettinger and
Steven D'Aprano in :issue:`35904`.)

Added :class:`statistics.NormalDist`, a tool for creating
and manipulating normal distributions of a random variable.
(Contributed by Raymond Hettinger in :issue:`36018`.)

::

>>> temperature_feb = NormalDist.from_samples([4, 12, -3, 2, 7, 14])
>>> temperature_feb
NormalDist(mu=6.0, sigma=6.356099432828281)

>>> temperature_feb.cdf(3) # Chance of being under 3 degrees
0.3184678262814532
>>> # Relative chance of being 7 degrees versus 10 degrees
>>> temperature_feb.pdf(7) / temperature_feb.pdf(10)
1.2039930378537762

>>> el_nino = NormalDist(4, 2.5)
>>> temperature_feb += el_nino # Add in a climate effect
>>> temperature_feb
NormalDist(mu=10.0, sigma=6.830080526611674)

>>> temperature_feb * (9/5) + 32 # Convert to Fahrenheit
NormalDist(mu=50.0, sigma=12.294144947901014)
>>> temperature_feb.samples(3) # Generate random samples
[7.672102882379219, 12.000027119750287, 4.647488369766392]


tokenize
--------
Expand Down
Loading