ParaMonte Fortran 2.0.0
Parallel Monte Carlo and Machine Learning Library
See the latest version documentation. |
This module contains classes and procedures for computing the properties related to the covariance matrices of a random sample.
More...
Data Types | |
interface | getVar |
Generate and return the variance of the input sample of type complex or real of shape (nsam) or (ndim, nsam) or (nsam, ndim) where ndim is the number of data dimensions (the number of data attributes) and nsam is the number of data points.More... | |
interface | getVarCorrection |
Generate and return the bias correction factor for the computation of the variance of a (weighted) sample. More... | |
interface | getVarMerged |
Generate and return the (weighted) merged variance of a complex or real sample resulting from the merger of two separate (weighted) samples \(A\) and \(B\).More... | |
interface | setVar |
Return the variance of an input (weighted) sample of type complex or real of shape (nsam) or (ndim, nsam) or (nsam, ndim) where ndim is the number of data dimensions (the number of data attributes) and nsam is the number of data points.More... | |
interface | setVarMean |
Return the (weighted) sample variance and mean of an input time series of nsam observations, or of an input sample of nsam observations with ndim attributes optionally weighted by the input weight , optionally also sum(weight) .More... | |
interface | setVarMeanMerged |
Return the (weighted) merged variance and mean of a complex or real sample resulting from the merger of two separate (weighted) samples \(A\) and \(B\).More... | |
interface | setVarMerged |
Return the (weighted) merged variance of a complex or real sample resulting from the merger of two separate (weighted) samples \(A\) and \(B\).More... | |
Variables | |
character(*, SK), parameter | MODULE_NAME = "@pm_sampleVar" |
This module contains classes and procedures for computing the properties related to the covariance matrices of a random sample.
Variance is the squared deviation from the mean of a random variable.
The variance is also often defined as the square of the standard deviation.
Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value.
It is the second central moment of a distribution, and the covariance of the random variable with itself.
It is frequently represented by \(\Sigma\), \(\sigma^2\), \(s^2\), \(\up{Var}(X)\), or \(\mathbb{V}(X)\).
An advantage of variance as a measure of dispersion is that it is more amenable to algebraic manipulation than other measures of dispersion such as the expected absolute deviation.
For example, the variance of a sum of uncorrelated random variables is equal to the sum of their variances.
A disadvantage of the variance for practical applications is that, unlike the standard deviation, its units differ from the random variable, which is why the standard deviation is more commonly reported as a measure of dispersion once the calculation is finished.
There are two distinct concepts that are both called variance.
One, as discussed above, is part of a theoretical probability distribution and is defined by an equation.
The other variance is a characteristic of a set of observations.
When variance is calculated from observations, those observations are typically measured from a real world system.
If all possible observations of the system are present then the calculated variance is called the population variance.
Normally, however, only a subset is available, and the variance calculated from this is called the sample variance.
The variance calculated from a sample is considered an estimate of the full population variance.
There are multiple ways to calculate an estimate of the population variance, as discussed in the section below.
The two kinds of variance are closely related.
To see how, consider that a theoretical probability distribution can be used as a generator of hypothetical observations.
If an infinite number of observations are generated using a distribution, then the sample variance calculated from that infinite set will match the value calculated using the distribution's equation for variance.
The term variance was apparently first introduced by Ronald Fisher in his 1918 paper The Correlation Between Relatives on the Supposition of Mendelian Inheritance.
The variance of a random variable \(X\) is the expected value of the squared deviation from the mean of \(X\), \(\mu= \up{E}[X]\):
\begin{equation} \up{Var}(X) = \up{E}\left[(X - \mu)^2\right] ~. \end{equation}
This definition encompasses random variables that are generated by processes that are discrete, continuous, neither, or mixed.
The variance can also be thought of as the covariance of a random variable with itself:
\begin{equation} \up{Var}(X) = \up{Cov}(X, X) ~. \end{equation}
The variance is also equivalent to the second cumulant of the probability distribution that generates \(X\).
The variance is typically designated as \(\up{Var}(X)\), or sometimes as \(V(X)\) or \(\mathbb{V}(X)\), or symbolically as \(\sigma_X^2\) or simply \(\sigma^2\).
The expression for the variance can be expanded as follows:
\begin{equation} \begin{aligned} \up{Var}(X) &= \up{E} \left[ (X - \up{E}[X])^2\right] \\ &= \up{E} \left[ X^2 - 2X\up{E}[X] + \up{E} [X]^{2} \right] \\ &= \up{E} \left[ X^2 \right] - 2\up{E}[X]\up{E}[X] + \up{E}[X]^{2} \\ &= \up{E} \left[ X^2 \right] - \up{E}[X]^{2} \end{aligned} \end{equation}
In other words, the variance of \(X\) is equal to the mean of the square of \(X\) minus the square of the mean of \(X\).<br< However, this formulation is never used for computations using floating point arithmetic, because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude.
Real-world observations such as the measurements of yesterday rain throughout the day typically cannot be complete sets of all possible observations that could be made.
As such, the variance calculated from the finite set will in general not match the variance that would have been calculated from the full population of possible observations.
This means that one estimates the mean and variance from a limited set of observations by using an estimator equation.
The estimator is a function of the sample of \(n\) observations drawn without observational bias from the whole population of potential observations.
The simplest estimators for population mean and population variance are simply the mean and variance of the sample, the sample mean and (uncorrected) sample variance.
These are consistent estimators as they converge to the correct value as the number of samples increases), but can be improved.
Estimating the population variance by taking the sample variance is close to optimal in general, but can be improved in two ways.
The sample variance is computed as an average of squared deviations about the (sample) mean, by dividing by \(n\).
However, using values other than \(n\) improves the estimator in various ways.
Four common values for the denominator are \(n\), \(n − 1\), \(n + 1\), and \(n − 1.5\):
Firstly, if the true population mean is unknown, then the sample variance (which uses the sample mean in place of the true mean) is a biased estimator:
It underestimates the variance by a factor of \((n − 1) / n\).
Correcting by this factor (dividing by \(n − 1\) instead of \(n\)) is called the Bessel correction.
The resulting estimator is unbiased, and is called the corrected sample variance or unbiased sample variance.
For example, when \(n = 1\) the variance of a single observation about the sample mean (itself) is obviously zero regardless of the population variance.
If the mean is determined in some other way than from the same samples used to estimate the variance then this bias does not arise and the variance can safely be estimated as that of the samples about the (independently known) mean.
In many practical situations, the true variance of a population is not known a priori and must be computed.
When dealing with extremely large populations, it is not possible to count every object in the population, so the computation must be performed on a sample of the population.
This is generally referred to as sample variance or empirical variance.
Sample variance can also be applied to the estimation of the variance of a continuous distribution from a sample of that distribution.
We take a sample with replacement of \(n\) values \(X_1, \ldots, X_n\) from the population and estimate the variance on the basis of this sample.
Directly taking the variance of the sample data gives the average of the squared deviations:
\begin{equation} \tilde{\sigma}_X^2 = \frac{1}{n} \sum_{i = 1}^{n} \left( X_i - \overline{X} \right)^2 = \left( \frac{1}{n} \sum_{i = 1}^{n} X_i^2 \right) - \overline{X}^2 = \frac{1}{n^2} \sum_{i, j ~:~ i < j} \left( X_i - X_j \right)^2 ~.<br> \end{equation}
Here, \(\overline{X}\) denotes the sample mean:
\begin{equation} \overline{X} = \frac{1}{n} \sum_{i = 1}^n X_i ~. \end{equation}
Since the \(X_i\) are selected randomly, both \(\overline{X}\) and \(\tilde{\sigma}_X^2\) are random variables.
Their expected values can be evaluated by averaging over the ensemble of all possible samples \(X_i\) of size \(n\) from the population.
For \(\tilde{\sigma}_X^2\) this gives:
\begin{equation} \begin{aligned} \up{E}[\tilde{\sigma}_X^2] &= \up{E} \left[ \frac{1}{n} \sum_{i = 1}^n \left( X_i - \frac{1}{n} \sum_{j = 1}^n X_{j} \right)^2 \right] \\ &= \frac{1}{n} \sum_{i = 1}^n \up{E} \left[ X_i^2 - \frac{2}{n} X_i \sum_{j = 1}^n X_j + \frac{1}{n^2} \sum_{j = 1}^n X_j \sum_{k = 1}^n X_k \right] \\ &= \frac{1}{n} \sum_{i = 1}^n \left( \frac{n - 2}{n} \up{E} \left[ X_i^2 \right] - \frac{2}{n} \sum_{j \neq i} \up{E} \left[ X_i X_j \right] + \frac{1}{n^2} \sum_{j = 1}^n \sum_{k \neq j}^n \up{E} \left[ X_j X_k \right] + \frac{1}{n^2} \sum_{j = 1}^n \up{E}\left[ X_j^2 \right] \right) \\ &= \frac{1}{n} \sum_{i = 1}^n \left[ \frac{n - 2}{n} \left( \sigma^2 + \mu^2 \right) - \frac{2}{n}(n - 1)\mu^2 + \frac{1}{n^2} n (n - 1) \mu^2 + \frac{1}{n} \left(\sigma^2 + \mu^2 \right)\right] \\ &= \frac{n - 1}{n} \sigma^2 ~. \end{aligned} \end{equation}
Hence \(\tilde{\sigma}_X^2\) gives an estimate of the population variance that is biased by a factor of \(\frac{n - 1}{n}\).
For this reason, \(\tilde{\sigma}_X^2\) is referred to as the biased sample variance.
The bias-correction factor in this case is \(\xi = \frac{n}{n - 1}\).
Correcting for this bias yields the unbiased sample variance, denoted \(\sigma^2\):
\begin{equation} \sigma^{2} = \frac{n}{n-1} \tilde{\sigma}_X^2 = \frac{n}{n - 1} \left[ \frac{1}{n} \sum_{i = 1}^n \left(X_i - \overline{X} \right)^2\right] = \frac{1}{n-1}\sum_{i = 1}^n\left(X_i - \overline{X} \right)^2 ~. \end{equation}
Either estimator may be simply referred to as the sample variance when the version can be determined by context.
The same proof is also applicable for samples taken from a continuous probability distribution.
The use of the term \(n − 1\) is called the Bessel correction, and it is also used in sample covariance and the sample standard deviation (the square root of variance).
The square root is a concave function and thus introduces negative bias (by the Jensen inequality), which depends on the distribution, and thus the corrected sample standard deviation is biased.
The unbiased estimation of standard deviation is a technically involved problem, though for the normal distribution using the term \(n − 1.5\) yields an almost unbiased estimator.
Given a weighted sample of \(n\) observations \(X_{1:n}\) with a weighted sample mean \(\hat\mu_w\), the biased variance of the weighted sample is computed as,
\begin{equation} \hat\sigma_w^2 = \frac{1}{\sum_{i=1}^{n} w_i} \sum_{i = 1}^n w_i (X_i - \hat\mu_w)^2 ~, \end{equation}
where n = nsam
is the number of observations in the sample, \(w_i\) are the weights of individual data points, and \(\hat\mu_w\) is the weighted mean of the sample.
When the sample size is small, the above equation yields a biased estimate of the variance.
There is no unique generic equation for the unbiased variance of a weighted sample.
However, depending on the types of the weights involved, a few popular definitions exist.
\begin{equation} \hat\sigma_w^2 = \frac{\xi}{\sum_{i=1}^{n} w_i} \sum_{i=1}^{n} w_i (X_i - \hat\mu_w)^2 = \frac{1}{\left( \sum_{i=1}^{n} w_i \right) - 1} \sum_{i=1}^{n} w_i (X_i - \hat\mu_w)^2 ~, \end{equation}
where the bias correction factor \(\xi\) is,\begin{equation} \xi = \frac{\sum_{i=1}^{n} w_i}{\left( \sum_{i=1}^{n} w_i \right) - 1} ~. \end{equation}
Frequency weights represent the number of duplications of each observation in the sample whose population variance is to be estimated.\begin{equation} \hat\sigma_w^2 = \frac{\xi}{\sum_{i=1}^{n}w_i} \sum_{i=1}^{n} w_i \left(X_i - \hat\mu_w\right)^2 = \frac{\sum_{i=1}^{n} w_i}{\left(\sum_{i=1}^{n}w_i\right)^2 - \left(\sum_{i=1}^{n}w_i^2\right)} \sum_{i=1}^{n} w_i \left(X_i - \hat\mu_w\right)^2 ~, \end{equation}
where the bias correction factor \(\xi\) is,\begin{equation} \xi = \frac{\left(\sum_{i=1}^{n} w_i\right)^2}{\left(\sum_{i=1}^{n}w_i\right)^2 - \left(\sum_{i=1}^{n}w_i^2\right)} ~. \end{equation}
Consider a sample \(X_{1..n}\) of size \(n\) with weights \(w_{1..n}\).
The weighted mean of this sample can be expressed as,
\begin{equation} \large \hat\mu_{1:n} = \frac{1}{w_{1:n}} \sum_{i = 1}^{n} w_i X_i ~, \end{equation}
where \(w_{1:n} = \sum w_{1..n}\) and the weighted biased variance of the sample can be expressed as,
\begin{eqnarray} \large \tilde{\sigma}_{1:n}^2 = \frac{\hat\sigma_{1:n}^2}{\xi_{1:n}} &=& \frac{1}{w_{1:n}} \sum_{i = 1}^{n} w_i (X_i - \hat\mu_{1:n})^2 ~, &=& \frac{1}{w_{1:n}} \sum_{i = 1}^{n} w_i X_i^2 - \xi_{1:n} \hat\mu_{1:n}^2 ~, \end{eqnarray}
where \(\xi_{1:n}\) is the appropriate bias-correction factor (which can be one).
This implies,
\begin{equation} \large \sum_{i = 1}^{n} w_i X_i^2 = w_{1:n} \left(\frac{\hat\sigma_{1:n}^2}{\xi_{1:n}} + \hat\mu_{1:n}\right) ~. \end{equation}
The above can be used to express the variance of the sample \(X_{1..n+m}\) resulting from the merger of two separate samples \(X_{1..n}\) and \(X_{n+1..n+m}\) as,
\begin{equation} \large \frac{\hat\sigma_{1:n+m}^2}{\xi_{1:n+m}} = \frac{1}{w_{1:n+m}} \left( w_{1:n} \left( \frac{\hat\sigma_{1:n}^2}{\xi_{1:n}} + \hat\mu_{1:n}^2 \right) + w_{n+1:n+m} \left(\frac{\hat\sigma_{n+1:n+m}^2}{\xi_{n+1:n+m}} + \hat\mu_{n+1:n+m}^2 \right) - w_{1:n+m} \hat\mu_{1:n+m}^2 \right) ~. \end{equation}
nsam
is small).sample
, if the variance is to be computed for the entire sample
(as opposed to computing it along a particular dimension), simply pass reshape(sample, shape = size(sample))
to the appropriate getVar interface.weight
arguments of type integer
or real
of various kinds, such extensions do not appear to add any tangible benefits beyond making the interface more flexible for the end user.integer
(frequency) weights, sum(weight)
involved in the computation may lead to an integer-overflow if individual weights are too large.real
before summation, which add an extra layer of unnecessary type coercion.integer
is multiplied with a real
, the integer
value must be first converted to real
of the same kind as the real value, then multiplied.real
will have to happen a second time when the weights are multiplied with the data values.real
multiplication on modern hardware (See, e.g., this thread).real
values in the computation are of the same kind as is here.integer
addition, but less than real
multiplication.weight
arguments of different type and kind.integer
and real
numbers.If \(x\) is a scalar complex-valued random variable, with values in \(\mathbb{C}\), then its variance is \(\up{E} \left[(x-\mu )(x-\mu )^{*}\right]\), where \(x^{*}\) is the complex conjugate of \(x\).
This variance is a real scalar.
If \(X\) is a vector-valued random variable, with values in \(\mathbb{R}^{n}\), and thought of as a column vector, then a natural generalization of variance is \(\up{E}\left[(X-\mu)(X-\mu)^{\up{T}}\right]\), where \(\mu = \up{E}(X)\) and \(X^{\up{T}}\) is the transpose of \(X\), and so is a row vector.
The result is a positive semi-definite square matrix, commonly referred to as the variance-covariance matrix (or simply as the covariance matrix).
If \(X\) is a vector- and complex-valued random variable, with values in \(\mathbb {C}^{n}\), then the covariance matrix is \(\up{E} \left[(X-\mu)(X-\mu)^{\dagger}\right]\), where \(X^{\dagger}\) is the conjugate transpose of \(X\).
This matrix is also positive semi-definite and square.
Another generalization of variance for vector-valued random variables \(X\), which results in a scalar value rather than in a matrix, is the generalized variance \(\det(C)\), the determinant of the covariance matrix.
The generalized variance can be shown to be related to the multidimensional scatter of points around their mean.
A different generalization is obtained by considering the variance of the Euclidean distance between the random variable and its mean.
This results in \(\up{E} \left[(X-\mu)^{\up {T}}(X-\mu)\right] = \up{tr}(C)\), which is the trace of the covariance matrix.
gfortran
gfortran
.gfortran
support, separate generic interfaces were instead developed for different sample weight types.gfortran
PDT bugs are resolved, the getVar generic interface can be extended to serve as a high-level wrapper for the weight-specific generic interfaces in this module.
Final Remarks ⛓
If you believe this algorithm or its documentation can be improved, we appreciate your contribution and help to edit this page's documentation and source file on GitHub.
For details on the naming abbreviations, see this page.
For details on the naming conventions, see this page.
This software is distributed under the MIT license with additional terms outlined below.
This software is available to the public under a highly permissive license.
Help us justify its continued development and maintenance by acknowledging its benefit to society, distributing it, and contributing to it.
character(*, SK), parameter pm_sampleVar::MODULE_NAME = "@pm_sampleVar" |
Definition at line 403 of file pm_sampleVar.F90.