notes-someStuffYouMightLikeToKnow-stuffChStats

Chapter: probability and statistics

pdf: like a histogram

joint, conditional, marginal probabilities

random variables vs ordinary variables

'statistic' is defined as "a function of a sample where the function itself is independent of the sample's distribution; that is, the function can be stated before realization of the data. The term statistic is used both for the function and for the value of the function on a given sample." -- [1]

Simpson's paradox

some other 'paradoxes': " For instance, naive students of probability may expect the average of a product to equal the product of the averages but quickly learn to guard against such expectations, given a few counterexamples. Likewise, students expect an association measured in a mixture distribution to equal a weighted average of the individual associations. They are surprised, therefore, when ratios of sums, (a+b)/(c+d), are found to be ordered differently than indi- vidual ratios, a/c and b/d. " -- http://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf

also, the Base Rate Fallacy, eg:

" A group of policemen have breathalyzers displaying false drunkenness in 5% of the cases in which the driver is sober. However, the breathalyzers never fail to detect a truly drunk person. One in a thousand drivers is driving drunk. Suppose the policemen then stop a driver at random, and force the driver to take a breathalyzer test. It indicates that the driver is drunk. We assume you don't know anything else about him or her. How high is the probability he or she really is drunk?

Many would answer as high as 0.95, but the correct probability is about 0.02. " [6]

algebra of expectations

https://en.wikipedia.org/wiki/Copula_%28probability_theory%29

robust statistics "are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from parametric distributions. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations, for example, one and three; under this model, non-robust methods like a t-test work badly.". Eg "The median is a robust measure of central tendency, while the mean is not." "The median absolute deviation and interquartile range are robust measures of statistical dispersion, while the standard deviation and range are not." [7]

https://en.wikipedia.org/wiki/Pivotal_quantity and https://en.wikipedia.org/wiki/Ancillary_statistic

todo

can't really summarize here because involves a lot of thinking about the translation between the real world and math (word problems)

bayes rule https://arbital.com/p/62c/?startPath

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414/

max entropy distributions

If you are modeling something and you have a list of constraints that a probability distribution should satisfy, but otherwise you know no other information, then a reasonable thing to do is to find the maximum entropy probability distribution for that list of constraints, and use that one.

See table 4.10 of this for a list of probability distributions, and for what lists of constraints each one is the maximum entropy distribution.

nonparametric regression

http://fisher.osu.edu/~schroeder.9/AMIS900/ch6.pdf

linear models

http://fisher.osu.edu/~schroeder.9/AMIS900/ch3.pdf

loss functions

http://fisher.osu.edu/~schroeder.9/AMIS900/ch4.pdf

iid

abbreviation for 'independent identically distributed'; applies to a set of random variables (for example, the heights of 10 people are represented by 10 random variables); means that (a) each of these random variables has the same distribution, and (b) these random variables are statistically independent of each other.

misc topics in probability

fuzzy set; credibility; https://www.google.com/url?q=http://www.gimac.uma.es/ipmu08/proceedings/papers/127-Li.pdf&sa=U&ei=3G74VPvDDIjooATy34LwAQ&ved=0CBAQFjAA&usg=AFQjCNGy2C20XQWnVHG9679lcklBZp7j8g explains credibility motivation

fat tail, long tail, long tail vs fat tail

https://en.wikipedia.org/wiki/Heavy-tailed_distribution

Some distributions

See https://en.wikipedia.org/wiki/Probability_distribution#Common_probability_distributions

Uniform

The simplest distribution.

Gaussian

Synonym: "normal distribution"

This is the go-to distribution for most situations (although some people argue that this can be very deceiving, due to its lack of heavy tails). It is good for real-valued quantities that can be positive or negative and that 'grow linearly' [8].

Minimum and maximum values are infinite.

Stable.

Central limit theorem: sums of iid

For real-valued things which 'grow exponentially'

Lognormal

The distribution of a random variable whose log has a Gaussian distribution.

Minimum value is 0, maximum value infinite.

"the skewness and the kurtosis of a standard LogNormal? distribution are equal, respectively, to 6 and 114" Lectura/Archivos curso Riesgo Operativo/moscadelli 2004.pdf

Considered a medium-tail distribution. [9].

Pareto

The distribution of a random variable whose log has an exponential distribution.

"the prototypical power law distribution" [10]

Minimum value is positive, maximum value infinite.

Considered a heavy-tail distribution.

A continuous analog of the Zeta distribution. [11]

Generalized Pareto distributions:

For binary-valued (or n-valued) things

Bernoulli distribution: a (possibly biased) coinflip (generalization to more than 2 options: 'categorical distribution')

Binomial distribution: the count of heads over a fixed number of coinflips. (generalization to more than 2 options: 'multinomial distribution'; the counts of each type of outcome over a fixed number of trials)

Geometric distribution: over many coinflips, the count of tails before the first head

Negative binomial distribution: (generalization of the geometric distribution) over many coinflips, the count of tails before a given count of heads is reached

Hypergeometric distribution: number of items that meet some condition given a fixed number of total draws from a fixed population, sampling without replacement (generalization to more than 2 options: 'Multivariate hypergeometric distribution')

Beta-binomial distribution: like a binomial distribution but the probability of heads is not fixed, but random, and follows the Beta distribution

For Poisson processes (events that occur with a given rate)

Poisson

count of events in a given period of time

Exponential

time before next event occurs

not to be confused with the concept of an "exponential family" of distributions!

Considered a medium-tail distribution. [12].

Gamma

time before next k events occur

also, see below for use as a conjugate prior

Considered a medium-tail distribution. [13].

For sums of squares of Gaussians

Chi-squared

sum of squares of Gaussians, eg sample variances

Student's t

the ratio of a Gaussian to the square root of a (scaled) Chi-squared; "useful for inference regarding the mean of normally distributed samples with unknown variance (see Student's t-test)" [14]

F ("F-distribution")

ratio of two chi squared random variables; "useful e.g. for inferences that involve comparing variances or involving R-squared (the squared correlation coefficient)" [15]

Useful as conjugate prior distributions in Bayesian inference

https://en.wikipedia.org/wiki/Probability_distribution#Common_probability_distributions

Beta

https://en.wikipedia.org/wiki/Beta_distribution

"for a single probability (real number between 0 and 1); conjugate to the Bernoulli distribution and binomial distribution" [16]

Dirichlet

a vector of probabilities that must sum to 1; generalization of beta

conjugate to the categorical distribution and multinomial distribution

Gamma

see above (under Poisson processes)

conjugate to a "non-negative scaling parameter; conjugate to the rate parameter of a Poisson distribution or exponential distribution, the precision (inverse variance) of a normal distribution, etc." [17]

A generalization of the gamma distribution is the wishart distribution, which is the conjugate prior of the covariance matrix of a multivariate normal distribution (a symmetric non-negative definite matrix). [18]

Extreme value theory

"EVT is applied to real data in two related ways. The first approach (see Reiss and Thomas, 2001, p. 14 ff) deals with the maximum (or minimum) values the variable takes in successive periods, for example months or years.

These observations constitute the extreme events, also called block (or per-period) maxima. At the heart of this approach is the “three-types theorem” (Fisher and Tippet, 1928), which states that there are only three types of distributions which can arise as limiting distributions of extreme values in random samples: the Weibull type, the Gumbel type and the Frechet type. This result is very important, since the asymptotic distribution of the maxima always belongs to one of these three distributions, regardless of the original one. Therefore the majority of the distributions used in finance and actuarial sciences can be divided into these three classes, according to their tail-heaviness:

The Weibull, Gumbel and Frechet distributions can be represented in a single three parameter model, known as the Generalised Extreme Value distribution (GEV)

... The parameters μ and σ correspond to location and scale; the third parameter, ξ , called the shape index, indicates the thickness of the tail of the distribution. The larger the shape index, the thicker the tail. ...

The second approach to EVT (see Reiss and Thomas, 2001, p. 23 ff) is the Peaks Over Threshold (POT) method, tailored for the analysis of data bigger than preset high thresholds. The severity component of the POT method is based on a distribution (Generalised Pareto Distribution - GPD), whose cumulative function is usually expressed as the following two parameter distribution ... It is possible to extend the family of the GPD distributions by adding a location parameter μ.

The interpretation of ξ in the GPD is the same as in the GEV, since all the relevant information on the tail of the original (unknown) overall distribution is embedded in this parameter 18 : when ξ < 0 the GPD is known as the Pareto “Type II” distribution, when ξ = 0 the GPD corresponds to the Exponential distribution. The case when ξ > 0 is probably the most important for operational risk data, because the GPD takes the form of the ordinary Pareto distribution with tail index α = 1/ ξ and indicates the presence of heavy-tail data 19 ; in this particular case there is a direct relationship between ξ and the finiteness of the moments of the distribution:

E(x^k) = \inf if k >= 1/ξ

For instance, if ξ ≥ 0.5 the GPD has an infinite variance, if ξ ≥ 1 there is no finite moment, not even the mean. This property has a direct consequence for data analysis: in fact the (heavier or lighter) behaviour of data in the tail can be easily directly detected from the estimate of the shape parameter.

"

-- [19]

Generalized extreme value distribution (GEV)

Generalizes Gumbel, Fréchet and Weibull families.

"the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables...Note that a limit distribution need not exist: this requires regularity conditions on the tail of the distribution" (see also [20]