Recent Articles



































Normal distribution



         


The normal distribution is an extremely important probability distribution in many fields. It is also called the Gaussian distribution, especially in physics and engineering. It is actually a family of distributions of the same general form, differing only in their location and scale parameters: the mean and standard deviation. The standard normal distribution is the normal distribution with a mean of zero and a standard deviation of one. Because the graph of its probability density resembles a bell, it is often called the bell curve.

[Top]

History

The normal distribution was first introduced by de Moivre in an article in 1733 (reprinted in the second edition of his The Doctrine of Chances, 1738) in the context of approximating certain binomial distributions for large n. His result was extended by Laplace in his book Analytical Theory of Probabilities (1812), and is now called the Theorem of de Moivre-Laplace.

Laplace used the normal distribution in the analysis of errors of experiments. The important method of least squares was introduced by Legendre in 1805. Gauss, who claimed to have used the method since 1794, justified it rigorously in 1809 by assuming a normal distribution of the errors.

The name "bell curve" goes back to Jouffret who used the term "bell surface" in 1872 for a bivariate normal with independent components. The name "normal distribution" was coined independently by Charles S. Peirce, Francis Galton and Wilhelm Lexis around 1875 [Stigler]. This terminology is unfortunate, since it reflects and encourages the fallacy that many or all probability distributions are 'Normal'. (See the discussion of "occurrence" below).

That the distribution is called the normal or Gaussian distribution is an instance of Stigler's law of eponymy: "No scientific discovery is named after its original discoverer".

[Top]

Specification of the normal distribution

There are various ways to specify a random variable. The most visual is the probability density function (plot at the top), which represents how likely each value of the random variable is. The cumulative density function is a conceptually cleaner way to specify the same information, but to the untrained eye its plot is much less informative (see below). Equivalent ways to specify the normal distribution are: the moments, the cumulants, the characteristic function, the moment-generating function, and the cumulant-generating function. Some of these are very useful for theoretical work, but not intuitive. See probability distribution for a discussion.

All of the cumulants of the normal distribution are zero, except the first two.

[Top]

Probability density function

The probability density function of the normal distribution with mean μ and standard deviation σ (equivalently, variance σ2) is an example of a Gaussian function,

<math>f(x) = {1 \over \sigma\sqrt{2\pi} }\,e^{-{(x-\mu )^2 / 2\sigma^2}}<math>

(See also exponential function and pi.) If a random variable X has this distribution, we write X ~ N(μ, σ2). If μ = 0 and σ = 1, the distribution is called the standard normal distribution, with formula

<math>f(x) = {1 \over \sqrt{2\pi} }\,e^{-{x^2 / 2}}<math>

The picture at the top of this article gives the graph of the probability density function of the normal distribution with μ = 0 and several values of σ.

For all normal distributions, the density function is symmetric about its mean value. About 68% of the area under the curve is within one standard deviation of the mean, 95.5% within two standard deviations, and 99.7% within three standard deviations. The inflection points of the curve occur at one standard deviation away from the mean.

[Top]

Cumulative distribution function

The cumulative distribution function (hereafter cdf) is defined as the probability that a variable X has a value less than x, and it is expressed in terms of the density function as

<math>\Pr(X \le x) = \int_{-\infty}^x \frac{1}{\sigma\sqrt{2\pi}} e^{-(u-\mu)^2/(2\sigma^2)}\,du<math>

The standard normal cdf, conventionally denoted <math>\Phi<math>, is just the general cdf evaluated with <math>\mu=0<math> and <math>\sigma=1<math>,

<math>\Phi(z) = \int_{-\infty}^z {1 \over \sqrt{2\pi} }\,e^{-{x^2 / 2}}\,dx<math>

The standard normal cdf can be expressed in terms of a special function called the error function, as

<math>\Phi(z) = \frac{1}{2} \left(1+\operatorname{erf}\,\frac{z}{\sqrt{2}}\right)<math>

The following graph shows the cumulative distribution function for values of z from -4 to +4:

On this graph, we see the probability that a standard normal variable has a value less than 0.25 is approximately equal to 0.60.

[Top]

Generating functions

[Top]

Moment generating function

[Top]

Characteristic function

The characteristic function is defined as the expected value of <math>e^{itX}<math>. For a normal distribution, it can be shown the characteristic function is

<math>\phi_X(t)=E\left[e^{itX}\right]=\int_{-\infty}^{\infty} \frac{1} {\sigma\sqrt{2\pi}}\,e^{-{(x-\mu )^2 / 2\sigma^2}}\,e^{itx}\,dx = e^{i\mu t-\sigma^2 t^2/2}<math>

as can be seen by completing the square in the exponent.

[Top]

Properties

  1. If X ~ N(μ, σ2) and a and b are real numbers, then aX + b ~ N(aμ + b, (aσ)2).
  2. If X1 ~ N(μ1, σ12) and X2 ~ N(μ2, σ22), and X1 and X2 are independent, then X1 + X2 ~ N(μ1 + μ2, σ12 + σ22).
  3. If X1, ..., Xn are independent standard normal variables, then X12 + ... + Xn2 has a chi-squared distribution with n degrees of freedom.
[Top]

Standardizing normal random variables

As a consequence of Property 1, it is possible to relate all normal random variables to the standard normal.

If X is a normal random variable with mean μ and variance σ2, then

<math> Z = \frac{X - \mu}{\sigma} <math>

is a standard normal random variable: Z~N(0,1). An important consequence is that the cdf of a general normal distribution is therefore

<math>\Pr(X \le x) = \Phi\left(\frac{x-\mu}{\sigma}\right) = \frac{1}{2} \left(1+\mbox{erf}\,\left(\frac{x-\mu}{\sigma\sqrt{2}}\right)\right)<math>

Conversely, if Z is a standard normal random variable, then

<math>X=\sigma Z+\mu \,<math>

is a normal random variable with mean μ and variance σ2.

The standard normal distribution has been tabulated, and the other normal distributions are simple transformations of the standard one. Therefore, one can use tabulated values of the cdf of the standard normal distribution to find values of the cdf of a general normal distribution.

[Top]

Generating normal random variables

For computer simulations, it is often useful to generate values that have a normal distribution. There are several methods; the most basic is to invert the standard normal cdf. More efficient methods are also known. One such method is the Box-Muller transform. The Box-Muller transform takes two uniformly distributed values as input and maps them to two normally distributed values. This requires generating values from a uniform distribution, for which many methods are known. See also random number generators.

The Box-Muller transform is a consequence of Property 3 and the fact that the chi-square distribution with two degrees of freedom is an exponential random variable (which is easy to generate).

[Top]

The central limit theorem

The normal distribution has the very important property that under certain conditions, the distribution of a sum of a large number of independent variables is approximately normal. This is the so-called central limit theorem.

The practical importance of the central limit theorem is that the normal distribution can be used as an approximation to some other distributions.

Whether these approximations are sufficiently accurate depends on the purpose for which they are needed, and the rate of convergence to the normal distribution. It is typically the case that such approximations are less accurate in the tails of the distribution.

[Top]

Infinite divisibility

The normal distributions are infinitely divisible probability distributions.

[Top]

Occurrence

Approximately normal distributions occur in many situations, as a result of the central limit theorem. When there is reason to suspect the presence of a large number of small effects acting additively and independently, it is reasonable to assume that observations will be normal. There are statistical methods to empirically test that assumption.

Effects can also act as multiplicative (rather than additive) modifications. In that case, the assumption of normality is not justified, and it is the logarithm of the variable of interest that is normally distributed. The distribution of the directly observed variable is then called log-normal.

Finally, if there is a single external influence which has a large effect on the variable under consideration, the assumption of normality is not justified either. This is true even if, when the external variable is held constant, the resulting marginal distributions are indeed normal. The full distribution will be a superposition of normal variables, which is not in general normal. This is related to the theory of errors (see below).

To summarize, here's a list of situations where approximate normality is sometimes assumed. For a fuller discussion, see below.

Of relevance to biology and economics is the fact that complex systems tend to display power laws rather than normality.

[Top]

Photon counts

Light intensity from a single source varies with time, and is usually assumed to be normally distributed. However, quantum mechanics interprets measurements of light intensity as photon counting. Ordinary light sources which produce light by thermal emission, should follow a Poisson distribution or Bose-Einstein distribution on very short time scales. On longer time scales (longer than the coherence time), the addition of independent variables yields an approximately normal distribution. The intensity of laser light, which is a quantum phenomenon, has an exactly normal distribution.

[Top]

Measurement errors

Repeated measurements of the same quantity are expected to yield results which are clustered around a particular value. If all major sources of errors have been taken into account, it is assumed that the remaining error must be the result of a large number of very small additive effects, and hence normal. Deviations from normality are interpreted as indications of systematic errors which have not been taken into account. Note that this is the central assumption of the mathematical interest and inflation, financial indicators such as interest rates, stock values, or commodity prices make good examples of multiplicative behaviour. As such, they should not be expected to be normal, but lognormal.

Benoît Mandelbrot, the popularizer of fractals, has claimed that even the assumption of lognormality is flawed, and advocates the use of log-Levy distributions.

It is accepted that financial indicators deviate from lognormality. The distribution of price changes on short time scales is observed to have "heavy tails", so that very small or very large price changes are more likely to occur than a lognormal model would predict. Deviation from lognormality indicates that the assumption of independence of the multiplicative influences is flawed.

[Top]

Lifetime

Other examples of variables that are not normally distributed include the lifetimes of humans or mechanical devices. Examples of distributions used in this connection are the exponential distribution (memoryless) and the Weibull distribution. In general, there is no reason that IQ scores and other ability scores are approximately normally distributed. For most IQ tests, the mean is 100 and the standard deviation is 15.

Criticisms: test scores are discrete variable associated with the number of correct/incorrect answers, and as such they are related to the binomial. Moreover (see ), raw IQ test scores are customarily 'massaged' to force the distribution of IQ scores to be normal. Finally, there is no widely accepted model of intelligence, and the link to IQ scores let alone a relationship between influences on intelligence and additive variations of IQ, is subject to debate.

[Top]

Maximum likelihood estimation of parameters

Suppose

<math>X_1,\dots,X_n<math>

are independent and identically distributed, and are normally distributed with expectation μ and variance σ2. In the language of statisticians, the observed values of these random variables make up a "sample from a normally distributed population." It is desired to estimate the "population mean" μ and the "population standard deviation" σ, based on observed values of this sample. The joint probability density function of these random variables is

<math>f(x_1,\dots,x_n) \propto \sigma^{-n} \prod_{i=1}^n \exp\left({-1 \over 2} \left({x_i-\mu \over \sigma}\right)^2\right)<math>

(Nota bene: Here the proportionality symbol <math>\propto<math> means proportional as a function of <math>\mu<math> and <math>\sigma<math>, not proportional as a function of <math>x_1,\dots,x_n<math>. That may be considered one of the differences between the statistician's point of view and the probabilist's point of view. The reason why this is important will appear below.)

As a function of μ and σ this is the likelihood function

<math>L(\mu,\sigma) \propto \sigma^{-n} \exp\left({-\sum_{i=1}^n (x_i-\mu)^2 \over 2\sigma^2}\right).<math>

In the method of maximum likelihood, the values of μ and σ that maximize the likelihood function are taken to be estimates of the population parameters μ and σ.

Usually in maximizing a function of two variables one might consider partial derivatives. But here we will exploit the fact that the value of μ that mazimizes the likelihood function with σ fixed does not depend on σ. Therefore, we can find that value of μ, then substitute it from μ in the likelihood function, and finally find the value of σ that maximizes the resulting expression.

It is evident that the likelihood function is a decreasing function of the sum

<math>\sum_{i=1}^n (x_i-\mu)^2.<math>

So we want the value of μ that minimizes this sum. Let

<math>\overline{x}=(x_1+\cdots+x_n)/n<math>

be the "sample mean". Observe that

<math>\sum_{i=1}^n (x_i-\mu)^2=\sum_{i=1}^n((x_i-\overline{x})+(\overline{x}-\mu))^2<math>
<math>=\sum_{i=1}^n(x_i-\overline{x})^2 + 2\sum_{i=1}^n (x_i-\overline{x})(\overline{x}-\mu) + \sum_{i=1}^n (\overline{x}-\mu)^2

<math>

<math>

=\sum_{i=1}^n(x_i-\overline{x})^2 + 0 + n(\overline{x}-\mu)^2. <math>

Only the last term depends on μ and it is minimized by

<math>\hat{\mu}=\overline{x}.<math>

That is the maximum-likelihood estimate of μ. Substituting that for μ in the sum above makes the last term vanish. Consequently, when we subsitute that estimate for μ in the likelihood function, we get

<math>L(\mu,\sigma) \propto \sigma^{-n} \exp\left({-\sum_{i=1}^n (x_i-\overline{x})^2 \over 2\sigma^2}\right).<math>

It is conventional to denote the "loglikelihood function", i.e., the logarithm of the likelihood function, by a lower-case <math>\ell<math>, and we have

<math>\ell(\hat{\mu},\sigma)=[\mathrm{constant}]-n\log(\sigma)-{\sum_{i=1}^n(x_i-\overline{x})^2 \over 2\sigma^2}<math>

and then

<math>{\partial \over \partial\sigma}\ell(\hat{\mu},\sigma)

={-n \over \sigma} +{\sum_{i=1}^n (x_i-\overline{x})^2 \over \sigma^3} ={-n \over \sigma^3}\left(\sigma^2-{1 \over n}\sum_{i=1}^n (x_i-\overline{x})^2 \right).<math>

This derivative is positive, zero, or negative according as σ2 is between 0 and

<math>{1 \over n}\sum_{i=1}^n(x_i-\overline{x})^2,<math>

or equal to that quantity, or greater than that quantity.

Consequently this average of squares of residuals is maximum-likelihood estimate of σ2, and its square root is the maximum-likelihood estimage of σ.

[Top]

Surprising generalization

The derivation of the maximum-likelihood estimator of the covariance matrix of a multivariate normal distribution is perhaps surprisingly subtle and elegant. It involves the spectral theorem and the reason why it can be better to view a scalar as the trace of a 1×1 matrix than as a mere scalar. See estimation of covariance matrices.

[Top]

Further reading

[Top]

External links and references







  View Live Article   This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License