notes-someStuffYouMightLikeToKnow-stuffChStats

Chapter: probability and statistics

pdf: like a histogram

joint, conditional, marginal probabilities

random variables vs ordinary variables

'statistic' is defined as "a function of a sample where the function itself is independent of the sample's distribution; that is, the function can be stated before realization of the data. The term statistic is used both for the function and for the value of the function on a given sample." -- [1]

Simpson's paradox

some other 'paradoxes': " For instance, naive students of probability may expect the average of a product to equal the product of the averages but quickly learn to guard against such expectations, given a few counterexamples. Likewise, students expect an association measured in a mixture distribution to equal a weighted average of the individual associations. They are surprised, therefore, when ratios of sums, (a+b)/(c+d), are found to be ordered differently than indi- vidual ratios, a/c and b/d. " -- http://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf

also, the Base Rate Fallacy, eg:

" A group of policemen have breathalyzers displaying false drunkenness in 5% of the cases in which the driver is sober. However, the breathalyzers never fail to detect a truly drunk person. One in a thousand drivers is driving drunk. Suppose the policemen then stop a driver at random, and force the driver to take a breathalyzer test. It indicates that the driver is drunk. We assume you don't know anything else about him or her. How high is the probability he or she really is drunk?

Many would answer as high as 0.95, but the correct probability is about 0.02. " [6]

algebra of expectations

https://en.wikipedia.org/wiki/Copula_%28probability_theory%29

robust statistics "are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from parametric distributions. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations, for example, one and three; under this model, non-robust methods like a t-test work badly.". Eg "The median is a robust measure of central tendency, while the mean is not." "The median absolute deviation and interquartile range are robust measures of statistical dispersion, while the standard deviation and range are not." [7]

https://en.wikipedia.org/wiki/Pivotal_quantity and https://en.wikipedia.org/wiki/Ancillary_statistic

todo

can't really summarize here because involves a lot of thinking about the translation between the real world and math (word problems)

bayes rule https://arbital.com/p/62c/?startPath

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414/

max entropy distributions

If you are modeling something and you have a list of constraints that a probability distribution should satisfy, but otherwise you know no other information, then a reasonable thing to do is to find the maximum entropy probability distribution for that list of constraints, and use that one.

See table 4.10 of this for a list of probability distributions, and for what lists of constraints each one is the maximum entropy distribution.

nonparametric regression

http://fisher.osu.edu/~schroeder.9/AMIS900/ch6.pdf

linear models

http://fisher.osu.edu/~schroeder.9/AMIS900/ch3.pdf

loss functions

http://fisher.osu.edu/~schroeder.9/AMIS900/ch4.pdf

iid

abbreviation for 'independent identically distributed'; applies to a set of random variables (for example, the heights of 10 people are represented by 10 random variables); means that (a) each of these random variables has the same distribution, and (b) these random variables are statistically independent of each other.

misc topics in probability

fuzzy set; credibility; https://www.google.com/url?q=http://www.gimac.uma.es/ipmu08/proceedings/papers/127-Li.pdf&sa=U&ei=3G74VPvDDIjooATy34LwAQ&ved=0CBAQFjAA&usg=AFQjCNGy2C20XQWnVHG9679lcklBZp7j8g explains credibility motivation

fat tail, long tail, long tail vs fat tail

https://en.wikipedia.org/wiki/Heavy-tailed_distribution

Some distributions

See https://en.wikipedia.org/wiki/Probability_distribution#Common_probability_distributions

Uniform

The simplest distribution.

Gaussian

Synonym: "normal distribution"

This is the go-to distribution for most situations (although some people argue that this can be very deceiving, due to its lack of heavy tails). It is good for real-valued quantities that can be positive or negative and that 'grow linearly' [8].

Minimum and maximum values are infinite.

Stable.

Central limit theorem: sums of iid

For real-valued things which 'grow exponentially'

Lognormal

The distribution of a random variable whose log has a Gaussian distribution.

Minimum value is 0, maximum value infinite.

"the skewness and the kurtosis of a standard LogNormal? distribution are equal, respectively, to 6 and 114" Lectura/Archivos curso Riesgo Operativo/moscadelli 2004.pdf

Considered a medium-tail distribution. [9].

Pareto

The distribution of a random variable whose log has an exponential distribution.

"the prototypical power law distribution" [10]

Minimum value is positive, maximum value infinite.

Considered a heavy-tail distribution.

A continuous analog of the Zeta distribution. [11]

Generalized Pareto distributions:

For binary-valued (or n-valued) things

Bernoulli distribution: a (possibly biased) coinflip (generalization to more than 2 options: 'categorical distribution')

Binomial distribution: the count of heads over a fixed number of coinflips. (generalization to more than 2 options: 'multinomial distribution'; the counts of each type of outcome over a fixed number of trials)

Geometric distribution: over many coinflips, the count of tails before the first head

Negative binomial distribution: (generalization of the geometric distribution) over many coinflips, the count of tails before a given count of heads is reached

Hypergeometric distribution: number of items that meet some condition given a fixed number of total draws from a fixed population, sampling without replacement (generalization to more than 2 options: 'Multivariate hypergeometric distribution')

Beta-binomial distribution: like a binomial distribution but the probability of heads is not fixed, but random, and follows the Beta distribution

For Poisson processes (events that occur with a given rate)

Poisson

count of events in a given period of time

Exponential

time before next event occurs

not to be confused with the concept of an "exponential family" of distributions!

Considered a medium-tail distribution. [12].

Gamma

time before next k events occur

also, see below for use as a conjugate prior

Considered a medium-tail distribution. [13].

For sums of squares of Gaussians

Chi-squared

sum of squares of Gaussians, eg sample variances

Student's t

the ratio of a Gaussian to the square root of a (scaled) Chi-squared; "useful for inference regarding the mean of normally distributed samples with unknown variance (see Student's t-test)" [14]

F ("F-distribution")

ratio of two chi squared random variables; "useful e.g. for inferences that involve comparing variances or involving R-squared (the squared correlation coefficient)" [15]

Useful as conjugate prior distributions in Bayesian inference

https://en.wikipedia.org/wiki/Probability_distribution#Common_probability_distributions

Beta

https://en.wikipedia.org/wiki/Beta_distribution

"for a single probability (real number between 0 and 1); conjugate to the Bernoulli distribution and binomial distribution" [16]

Dirichlet

a vector of probabilities that must sum to 1; generalization of beta

conjugate to the categorical distribution and multinomial distribution

Gamma

see above (under Poisson processes)

conjugate to a "non-negative scaling parameter; conjugate to the rate parameter of a Poisson distribution or exponential distribution, the precision (inverse variance) of a normal distribution, etc." [17]

A generalization of the gamma distribution is the wishart distribution, which is the conjugate prior of the covariance matrix of a multivariate normal distribution (a symmetric non-negative definite matrix). [18]

Extreme value theory

"EVT is applied to real data in two related ways. The first approach (see Reiss and Thomas, 2001, p. 14 ff) deals with the maximum (or minimum) values the variable takes in successive periods, for example months or years.

These observations constitute the extreme events, also called block (or per-period) maxima. At the heart of this approach is the “three-types theorem” (Fisher and Tippet, 1928), which states that there are only three types of distributions which can arise as limiting distributions of extreme values in random samples: the Weibull type, the Gumbel type and the Frechet type. This result is very important, since the asymptotic distribution of the maxima always belongs to one of these three distributions, regardless of the original one. Therefore the majority of the distributions used in finance and actuarial sciences can be divided into these three classes, according to their tail-heaviness:

The Weibull, Gumbel and Frechet distributions can be represented in a single three parameter model, known as the Generalised Extreme Value distribution (GEV)

... The parameters μ and σ correspond to location and scale; the third parameter, ξ , called the shape index, indicates the thickness of the tail of the distribution. The larger the shape index, the thicker the tail. ...

The second approach to EVT (see Reiss and Thomas, 2001, p. 23 ff) is the Peaks Over Threshold (POT) method, tailored for the analysis of data bigger than preset high thresholds. The severity component of the POT method is based on a distribution (Generalised Pareto Distribution - GPD), whose cumulative function is usually expressed as the following two parameter distribution ... It is possible to extend the family of the GPD distributions by adding a location parameter μ.

The interpretation of ξ in the GPD is the same as in the GEV, since all the relevant information on the tail of the original (unknown) overall distribution is embedded in this parameter 18 : when ξ < 0 the GPD is known as the Pareto “Type II” distribution, when ξ = 0 the GPD corresponds to the Exponential distribution. The case when ξ > 0 is probably the most important for operational risk data, because the GPD takes the form of the ordinary Pareto distribution with tail index α = 1/ ξ and indicates the presence of heavy-tail data 19 ; in this particular case there is a direct relationship between ξ and the finiteness of the moments of the distribution:

E(x^k) = \inf if k >= 1/ξ

For instance, if ξ ≥ 0.5 the GPD has an infinite variance, if ξ ≥ 1 there is no finite moment, not even the mean. This property has a direct consequence for data analysis: in fact the (heavier or lighter) behaviour of data in the tail can be easily directly detected from the estimate of the shape parameter.

"

-- [19]

Generalized extreme value distribution (GEV)

Generalizes Gumbel, Fréchet and Weibull families.

"the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables...Note that a limit distribution need not exist: this requires regularity conditions on the tail of the distribution" (see also [20] )

"The GEV distribution is widely used in the treatment of "tail risks" in fields ranging from insurance to finance. In the latter case, it has been considered as a means of assessing various financial risks via metrics such as Value at Risk.[2] However, the resulting shape parameters have been found to lie in the range leading to undefined means and variances, which poses a threat to reliable data analysis.[3]" [21]

Links:

Gumbel

https://en.wikipedia.org/wiki/Gumbel_distribution

"the distribution of the maximum (or the minimum) of a number of samples..if the distribution of the underlying sample data is of the normal or exponential type"

"the maximum value (or last order statistic) in a sample of a random variable following an exponential distribution approaches the Gumbel distribution closer with increasing sample size."

"In number theory, the Gumbel distribution approximates the number of terms in a partition of an integer[7] as well as the trend-adjusted sizes of record prime gaps and record gaps between prime constellations.[8]"

"Gumbel is unlimited" [22]

"also known as the log-Weibull distribution"

Considered a medium-tail distribution. [23].

Weibull

" If the quantity X is a "time-to-failure", the Weibull distribution gives a distribution for which the failure rate is proportional to a power of time. The shape parameter, k, is that power plus one, and so this parameter can be interpreted directly as follows:

    A value of k < 1 indicates that the failure rate decreases over time. This happens if there is significant "infant mortality", or defective items failing early and the failure rate decreasing over time as the defective items are weeded out of the population.
    A value of k = 1 indicates that the failure rate is constant over time. This might suggest random external events are causing mortality, or failure.
    A value of k > 1 indicates that the failure rate increases with time. This happens if there is an "aging" process, or parts that are more likely to fail as time goes on."

used in survival analysis, but also in:

"the reversed Weibull has an upper limit." [25]

Considered a light-tail distribution [26].

Links:

Frechet

"Fréchet has a lower limit" [27]

Other/todo

Cauchy

Stable.

Levy

Stable.

Zeta

A discrete analog of the Pareto distribution. [28]. A normalization of the Zipf distribution [29] (is there a separate Zipf distribution, or just a 'Zipf's law'?).

Yule-Simon

Its tail realizes Zipf's law.

some classes of probability distributions

stable distributions

"In probability theory, a distribution is said to be stable (or a random variable is said to be stable) if a linear combination of two independent copies of a random sample has the same distribution, up to location and scale parameters."

All stable distributions are heavy-tailed except for the Gaussian distribution. The Gaussian distribution is the only stable distribution with finite variance. "Stable distributions form a four-parameter family of continuous probability distributions parametrized by location and scale parameters μ and c, respectively, and two shape parameters β and α, roughly corresponding to measures of asymmetry and concentration, respectively (see the figures)" [30]. "The probability density function for a general stable distribution cannot be written analytically", but "the general characteristic function can be".

Some examples of stable distributions are:

Links:

exponential family

not to be confused with the exponential distribution!

" Exponential families have a large number of properties that make them extremely useful for statistical analysis. In many cases, it can be shown that, except in a few exceptional cases, only exponential families have these properties. Examples:

    Exponential families have sufficient statistics that can summarize arbitrary amounts of independent identically distributed data using a fixed number of values.
    Exponential families have conjugate priors, an important property in Bayesian statistics.
    The posterior predictive distribution of an exponential-family random variable with a conjugate prior can always be written in closed form (provided that the normalizing factor of the exponential-family distribution can itself be written in closed form). Note that these distributions are often not themselves exponential families. Common examples of non-exponential families arising from exponential ones are the Student's t-distribution, beta-binomial distribution and Dirichlet-multinomial distribution.
    In the mean-field approximation in variational Bayes (used for approximating the posterior distribution in large Bayesian networks), the best approximating posterior distribution of an exponential-family node (a node is a random variable in the context of Bayesian networks) with a conjugate prior is in the same family as the node.[citation needed]" -- https://en.wikipedia.org/wiki/Exponential_family

" It is critical, when considering the examples in this section, to remember the discussion above about what it means to say that a "distribution" is an exponential family, and in particular to keep in mind that the set of parameters that are allowed to vary is critical in determining whether a "distribution" is or is not an exponential family.

The normal, exponential, log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, geometric, inverse Gaussian, von Mises and von Mises-Fisher distributions are all exponential families.

Some distributions are exponential families only if some of their parameters are held fixed. The family of Pareto distributions with a fixed minimum bound xm form an exponential family. The families of binomial and multinomial distributions with fixed number of trials n but unknown probability parameter(s) are exponential families. The family of negative binomial distributions with fixed number of failures (a.k.a. stopping-time parameter) r is an exponential family. However, when any of the above-mentioned fixed parameters are allowed to vary, the resulting family is not an exponential family. " -- https://en.wikipedia.org/wiki/Exponential_family

Pearson family

Heavy-tails

https://en.wikipedia.org/wiki/Heavy-tailed_distribution

subsets:

Elliptical family

statistical concepts

populations vs samples drawn from that population

An example of a population is 'all people on Earth'. An example of a sample drawn from that population is "10 people who signed up for my study".

Usually you can only get data on samples, not on the entire population. But usually what you're interested in is attributes of the population, not the sample.

Sometimes the best way to estimate something about the population is to see what that something is in the sample (the mean typically works like this; to estimate the population mean, calculate the mean on the sample, and that's your estimate). But often the best way to estimate something about the population is to use a DIFFERENT calculation to estimate it from the sample than you would use to compute it definitely on the whole population, if you had access to the whole population. For example, the formula for the standard deviation, if computed directly on the population, is different from the best formula to ESTIMATE the true population standard deviation from a sample.

Of course, you can always compute the result of applying the population formula unchanged to the sample, even though that's not the best thing to do if your goal is to estimate what the same formula would be if applied to the entire population. We use wording like 'standard deviation of the sample' or 'Uncorrected sample standard deviation' for this.

rejecting the null hypothesis

A lot of statistics tests must be interpreted within a framework of 'rejecting the null hypothesis'. The general picture is that, if you would like to prove that X is probable, instead what you do is you show that given an assumption of not-X, the probability of observing the data that you actually observed is extremely low. not-X is called the 'null hypothesis'.

In more detail, you imagine that if you didn't have any evidence, you would believe that the null hypothesis is true; now if you have some evidence, what you do is you compute, under the assumption that the null hypothesis actually were true, the probability of seeing the data that you actually saw (the "p-value"). If this probability is extremely low (below the "p-value significance threshold"), you 'reject the null hypothesis' and begin believing that the null hypothesis is false.

Note that if you do NOT reject the null hypothesis, you can't really draw any conclusions for your data; it's possible that the null hypothesis actually is false, but you just didn't have enough data to see that. "Absence of evidence is not evidence of absence".

Some suggest that rather than accepting/rejecting based on p-value thresholds, a more informative way of looking at a result is by looking at the confidence interval on effect size [32].

'type 1 errors' (often the same as a false positive; defined as when "the null hypothesis ... is true, but is rejected") and 'type 2 errors' (often the same as a false negative; defined as when "the null hypothesis is false, but erroneously fails to be rejected"). 'alpha' is the probability of a type I error, and is equal to the significance threshold. 'beta' is the probability of a type II error. 1 - beta is the "power".

In order to compute the needed sample size, you must first determine 4 things: alpha, beta (or power), the population standard deviation, the minimum detectable effect size.

The 'effect size' for comparing the mean of two samples of the same variance is defined as:

(u0 - u1)/stddev

Cohen classified effect sizes as small, moderate, and large (0.2, 0.5, and 0.8 for two-group comparisons) [33]

rule of thumb for alpha

.05

"Why 5%? Sir Ronald Fisher suggested this as an appropriate threshold level. However, he meant that if the p-value from an initial experiment were less than .05 then the REAL research should begin. This has been corrupted to such an extent that at the first sign of a p-value under .05 the researchers race to publish the result!" -- [34]

rule of thumb for power (and for beta)

.8 power (.2 beta)

"Note that this assumes that the risk of a Type II error can be four times as great as the risk of a Type I error. Why 80%? According to Streiner and Norman, this was because “Jacob Cohen [who wrote the landmark textbook on Statistical Power Analysis] surveyed the literature and found that the average power was barely 50%. His hope was that, eventually, both α and β would be .05 for all studies, so he took β = .20 as a compromise and thought that over the years, people would adopt more stringent levels. It never happened." -- [35]

(note: elsewhere, Cohen wrote that a more informative way of looking at a result is by looking at the confidence interval on effect size, and that a confidence interval of 80% is good for the purposes of typical psychological research)

one-tailed and two-tailed

"Two-tailed versus one-tailed tests:

In inference that investigates whether there is a difference between two groups, there are two approaches to formulating the alternative hypothesis. Either you know the direction of the difference in advance of doing the study or you do not. A one-tailed test specifies the direction of the difference in advance. A two-tailed test does not specify the direction of the difference. For sample size estimation stick to two-tailed alternative hypotheses! For example, when comparing a new therapy to a standard therapy one might be convinced that the new therapy could only be better! But, examples abound where a new therapy under study was actually worse. Think about the case when Coca Cola introduced “New Coke” expecting it to improve sales. The huge negative public outcry was completely unexpected by Coca Cola." -- [36]

fiddly philosophical details, alternatives, controversies

Now, the null hypothesis discussion above is kindof an oversimplification. Do you really "believe" that the null hypothesis is true, and switch your belief when the p-value threshold is breached? Probably not. In fact, what you set out to do was prove that X was probable; but all the 'reject the null hypothesis' analyses are good for is for showing that GIVEN not-X, the DATA is IMprobable. In other words, the null hypothesis stuff ultimately is only making a statement about the probability of the data given some model; it doesn't actually make any statement about the probability of the model itself. If you actually want to model belief and the point at which (or degree to which) they 'switch', you should instead use Bayesian inference, which tells you, given your prior degree of belief in something, and given some new evidence, what should your new degree of belief in that thing be.

So why use the 'reject the null hypothesis' stuff? The 'reject the null hypothesis' stuff is convenient in that it doesn't require you to specify a degree of prior belief, which means there is one less thing for critics to disagree with each other on. There are probably some other philosophical advantages that i'm not aware of.

This stuff came out of work called 'significance testing' (also called Fisher significance testing), which is more focused on 'rejecting the null hypothesis', and 'hypothesis testing' (also called Neyman-Pearson hypothesis testing), which also explicitly brings in consideration of an 'alternative hypothesis'. Note that although hypothesis testing was developed after significance testing, Mr. Fischer was aware of the methods of hypothesis testing and rejected them, at least for the purposes of science [37]. Note that "The terminology is inconsistent. Hypothesis testing can mean any mixture of two formulations that both changed with time. Any discussion of significance testing vs hypothesis testing is doubly vulnerable to confusion....The dispute over formulations is unresolved. Science primarily uses Fisher's (slightly modified) formulation as taught in introductory statistics. Statisticians study Neyman–Pearson? theory in graduate school. Mathematicians are proud of uniting the formulations. Philosophers consider them separately. Learned opinions deem the formulations variously competitive (Fisher vs Neyman), incompatible[27] or complementary.[33] The dispute has become more complex since Bayesian inference has achieved respectability." [38].

todo: what if you used the Bayesian approach to calculate the lowest prior probability such that after seeing this evidence your confidence in the desired conclusion would jump to above 50%? Would that end up being equivalent to the 'reject the null hypothesis' stuff?

In addition, the phrasing 'statistically significant', which means the condition when an observed p-value is less than some significance level, is rather misleading. When a p-value does not meet the significance level (is larger than the threshold), we fail to reject the null hypothesis but 'absence of evidence is not evidence of absence'; that is, if something is 'statistically insignificant' that doesn't mean it doesn't exist nor that it is not of practical importance (practical significance). Similarly, something that is 'statistically significant' does not imply that the effect size is large enough to be of practical significance (eg if taking X vitamin everyday reliably caused an increase of IQ of 0.001 point on average). One thing about significance testing is that, in the real world, there are tiny correlations between all sorts of almost unrelated things. So the null hypothesis will almost always be rejected, with a large enough sample size (but with tiny effect sizes!).

Links:

some statistical measures

read this great article: http://debrouwere.org/2017/02/01/unlearning-descriptive-statistics/

(note: imo he's selling the mean a little short; in cases where the tails matter a lot, the median ignores the tails but the mean captures at least some information from them)

Measure of central tendency

mean (average)

sum(data)/len(data)

If we compute the sample mean, how different is it likely to be from the actual population mean? This question is often answered using a measure called the "standard error of the mean" (SEM).

The standard error is "...the standard deviation of the sample-mean's estimate of a population mean. (It can also be viewed as the standard deviation of the error in the sample mean with respect to the true mean.."

The formula is:

stddev/sqrt(n)

Note that for samples, this contains an estimate the standard deviation from a sample. So you should use the 'corrected sample standard deviation' formula for stddev (see below), so for practical purposes the forumla is really:

sqrt(1/(n-1)*sum((x - u)^2))/sqrt(n)

A confidence interval of 95% on the mean of a normal distribution is about +-1.96 of the standard error; so it is useful to report +- 2 standard errors. However, according to [39], for small samples ([40] uses the rule of thumb of <100), you should use 2.78 rather than 1.96 (because for some reason, in the derivation, the t distribution rather than a normal distribution should be used for estimating variance from small samples; the t distribution looks like the normal distribution for large sample sizes but is more heavy-tailed (leptokurtic) for small sample sizes; i dont really understand exactly what the t distribution is being used for here, i'm just repeating what i read)

(if 2 standard errors are a 95% confidence interval, how much confidence is in +- 1 standard error? i think it follows the https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule , so 68.27% confidence.

(i often see the rule of thumb +-1.96 standard error, or 95% confidence, but [41] suggests using 80% confidence; [42] thinking in terms of the rounded fraction of surprises, eg 1 standard deviation or standard error (1 sigma) means being surprised about 1 in 3 times, 1.5 sigma means being surprised about 1 in 7 times, 2 sigma -> 1/22, 2.5 sigma ->1/81, 3 sigma -> 1/370, 3.5 sigma 1/2149, 4 sigma 1/15787, 4.5 sigma 1/147160 ; an 80% confidence interval is close to 1.5 sigma which is 1/7; a bigger table is at [43]).

(if you know the distribution ahead of time, you can use distribution-specific formulas with even better corrections)

(if the samples have serial correlation, use a correction https://en.wikipedia.org/wiki/Standard_error#Correction_for_correlation_in_the_sample )

(if the sample is a large proportion of the total population, use a correction https://en.wikipedia.org/wiki/Standard_error#Correction_for_finite_population )

As a rule of thumb a sample is 'small' here when n < 20; (but with n = 6 the bias of the uncorrected formula, stddev/sqrt(n) is already only 5%) [44]. The U.S. National Center for Health Statistics typically requires at least 30 observations for an estimate to be reported. A rule of thumb for a 'good' standard error, for cases where the random variable cannot have a mean of zero, is that the standard error should be less than about 1/3 of the mean (the ratio of the standard deviation to the mean is called the 'coefficient of variation'; the reciprocal of this is sometimes called "signal to noise ratio (SNR)" but there is another definition of this (power of signal divided by power of noise) and i am not sure if they are equivalent; it seems to me that only in the special case where the signal is constant (the signal has a fixed mean and a zero standard deviation, and the noise has a zero mean and a nonzero standard deviation) then the SNR is equal to the reciprocal of the coefficient of variation; some other related quantities are the Sharpe ratio and Information Ratio from finance). A rule of thumb for p-value thresholds is 5%. More rules of thumb for sample size: http://stats.stackexchange.com/questions/2541/is-there-a-reference-that-suggest-using-30-as-a-large-enough-sample-size] ; [45] ; [46] ; [47] table 3 ; [48].

Links:

median

midhinge

average of the first and third quartiles

https://en.wikipedia.org/wiki/Midhinge

trimean

(quartile_1 + 2*median + quartile_3)/4 = average([median, midhinge])

" Like the median and the midhinge, but unlike the sample mean, it is a statistically resistant L-estimator with a breakdown point of 25%.This beneficial property has been described as follows: "An advantage of the trimean as a measure of the center (of a distribution) is that it combines the median's emphasis on center values with the midhinge's attention to the extremes." --  Herbert F. Weisberg, Central Tendency and Variability " -- https://en.wikipedia.org/wiki/Trimean

comparision re: minimization

" The measures of statistical dispersion derived from absolute deviation characterize various measures of central tendency as minimizing dispersion: The median is the measure of central tendency most associated with the absolute deviation. Some location parameters can be compared as follows:

    L2 norm statistics: the mean minimizes the mean squared error
    L1 norm statistics: the median minimizes average absolute deviation,
    L∞ norm statistics: the mid-range minimizes the maximum absolute deviation
    trimmed L∞ norm statistics: for example, the midhinge (average of first and third quartiles) which minimizes the median absolute deviation of the whole distribution, also minimizes the maximum absolute deviation of the distribution after the top and bottom 25% have been trimmed off.."

-- https://en.wikipedia.org/wiki/Average_absolute_deviation#Minimization

similarly, see http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/

Measures of statistical dispersion

https://en.wikipedia.org/wiki/Statistical_dispersion

Measures of deviation: mad, stddev, stderr, and related

mean absolute deviation (mad)

The 'mean absolute deviation around the mean' means the average (over all of the data points) of the absolute value of the difference between the data point, and the mean of all of the data.

Can also compute 'mean absolute deviation around the median', etc.

https://en.wikipedia.org/wiki/Average_absolute_deviation

standard deviation (stddev)

a more tractable stand-in for mean absolute deviation.

The formula is

sqrt(mean((x - u)^2))

where 'mean' is over all data points, where x is a data point, and were u is the mean.

Could also be called RMS deviation (root-mean-squared deviation).

The square of the standard deviation is called the variance.

For technical reasons, there is a correction that should usually be made, which all together gives us the following formula for 'corrected sample standard deviation', which is usually how you should estimate the population stddev from a sample (note: you can actually do a little better by using different formulas for different distributions, if you know the distribution you are working with):

sqrt(1/(n-1)*sum((x - u)^2))

(NOTE to self: the measured standard deviation does NOT need to be further divided by sqrt(n) to correct for sample size; if you thought of that, you're thinking of the formula for standard error)

The Gaussian distribution has only two parameters: the mean and the stddev.

Comparing stddev with mean absolute deviation ('MAD'):

note:

Links:

higher moment-related measures

skewness

"Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. " [53]

adjusted Fisher-Pearson coefficient of skewness:

sqrt(N*(N−1))/(N-1) * sum_x ((x - u)^3/N)/stddev^3

the skewness of a Gaussian is 0

A robust statistic for skewness is the https://en.wikipedia.org/wiki/Medcouple

kurtosis

"Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case. " [54]

sum_x ((x - u)^4/N)/stddev^4

excess kurtosis:

kurtosis - 3

(because the kurtosis of a Gaussian is 3)

note: "Many sources use the term kurtosis when they are actually computing "excess kurtosis", so it may not always be clear." [55]

some initial steps to exploring data

for individual variables:

for 2 variables:

for time series:

for many variables:

for many time series:

See also https://en.wikipedia.org/wiki/Exploratory_data_analysis

Non-deviation measures of statistical dispersion

interquartile range (IQR)

scale-free measures of effect size

Section 2.4 of "Statistical inference: A commentary for the social and behavioral sciences" "lists some scale-free measures of effect size, as alternatives to the confidence interval. These include the standardized difference, the proportion misclassified, and two estimates of “the proportion of variance accounted for”: the squared correlation r^2, and the \omega^2 statistic or omega squared. However, something like the absolute difference in two means can be easier to interpret correctly than the correlation or others." -- [57]

some statistical tests

power analysis rules of thumb

Reasonable values of alpha and power are often .05 and .8 [58].

is a coin fair?

You toss a coin n times and see h heads and t tails (h + t = n). Is the coin fair?

The null hypothesis is that the coin is fair

classic method: p(h,t) = nchoosek(n, h)*.5^h*(1 - .5)^t

reject the null hypothesis (conclude that the coin is unfair) iff p(h,t) < p-value significance threshold eg iff p(h,t) < 0.05

alternative direct simulation method that doesn't require a closed-form solution for p(h,t): "in general, computing the sampling distribution is hard; simulating the sampling distribution is easy"

if h > t: draw a sample from the null hypothesis distribution (eg flip a fair coin N times). Record whether or not, in the sample, sample_h >= h. Repeat many times. if h <= t: draw a sample from the null hypothesis distribution (eg flip a fair coin N times). Record whether or not, in the sample, sample_t >= t. Repeat many times. in either case: count the proportion of recorded 'True's. Compare this # to your p-value. Iff the proportion of recorded 'True's is smaller than your p-value (i think?), reject the null hypothesis (conclude the coin is unfair)

in Python, if n = 30, h = 22:

n = 30
h = 22
trials = 10000

M = 0
for i in range(trials):
  trials = randint(2, size=n)
  if (trials.sum() >= h):
    M += 1
p = M / trials

(from https://speakerdeck.com/jakevdp/statistics-for-hackers slides near 18)

Welch's t-test

Remember, make sure your samples are representative.

Welch's t-test (unequal variances t-test) "used to test the hypothesis that two populations have equal means" without assuming equal standard deviations (Student's t-test assumes equal standard deviations). The test statistic is: (mean_1 - mean_2) / sqrt(stddev_1^2/sample_size_1 + stddev_2^2/sample_size_2) Compute the test statistic then compare it to the t-distribution "to test the null hypothesis that the two population means are equal (using a two-tailed test), or the alternative hypothesis that one of the population means is greater than or equal to the other (using a one-tailed test).". To create the t-distribution, you'll need a degrees of freedom parameter, 'v'. The formula for that in this case is: v = (stddev_1^2/sample_size_1 + stddev_2^2/sample_size_2)^2/(stddev_1^4/(sample_size_1^2*(sample_size_1 - 1) + sample_size_2^2*(sample_size_2 - 1)) Find a table for the t-distribution; the table will ask what the degrees of freedom is ('df'), and what the p-value is that you are shooting for (a rule of thumb is .05), and whether you want a one-tailed or two-tailed test; and it will give you a t-distribution number (your 'critical value'). Take that number (the critical value) and compare it to the test statistic; if the test statistic is bigger, then the null hypothesis is rejected (the means were significantly different). If you are unsure if you have equal variances, do not pre-test for equal variances and then choose between Student's t-test or Welch's t-test; just use Welch's t-test immediately [https://en.wikipedia.org/wiki/Welch's_t_test#Advantages_and_limitations [59].

Links:

Non-parametric alternative to Welch's t-test via shuffling

Remember, this is for when the null hypothesis is that the two populations are equal. Remember, make sure your samples are representative.

Shuffle both samples (eg combine both samples, then randomly separate the lumped-together sample's data points into two samples of the same size as the actual two samples (so that now eg the new pseudo-sample #1 is the same sample size as the actual sample #1, but (probably) is a mixture of data points from both sample #1 and sample #2; and similarly for the new pseudo-sample #2)). Calculate the mean for each of the 2 pseudo-samples, and record the difference between the means. Reshuffle, repeat many times. Now you have sampled many times from the distribution of differences between the means for shufflings of this dataset. Now look at the actual difference between the actual means, and see what proportion of the recorded differences between pseudo-sample means is larger than this value. Compare this to 1 - your p-value (eg 95%); eg if the actual difference between the means is not larger than 95% of your difference-between-the-means samples, then reject the null hypothesis (ie conclude that you do not have enough evidence to say that the two populations have unequal means).

(from https://speakerdeck.com/jakevdp/statistics-for-hackers )

Are these two samples from the same distribution?

Kolmogorov-Smirnov test (K-s)

Anderson-Darling test (A-D)

The Anderson-Darling test is "much more sensitive to the tails of data" than Kolmogorov-Smirnov [60]

Mann-Whitney

todo

Wilcoxon

todo

Kruskal-Wallis

todo

Test of Equal Proportions

https://stat.ethz.ch/R-manual/R-patched/library/stats/html/prop.test.html

eg http://blogs.wsj.com/digits/2015/01/06/googles-cloud-loses-following-among-cios-survey-finds/; a survey was conducted with 112 people, asking the popularity of various vendors. The survey was repeated the next year. Vendor X scored 12% on the first year's survey, and 7% on the second year's survey. [61] shows that, even using slightly better numbers than these, the p-value is 36%, that is, there is not enough evidence to say with much certainty that the 12%-7% drop wasn't just random fluctuation.


stats heuristics

In a Gaussian distribution, the ratio of standard deviation to expected absolute deviation is sqrt(pi/2), which is close to 1.25. In a Student's t distribution, it is pi/2, which is close to 1.6 -- http://papers.ssrn.com/sol3/papers.cfm?abstract_id=970480 (i wonder what the derivation is? probably see Geary, R. C. (1935). The ratio of the mean deviation to the standard deviation as a test of normality. Biometrika, 27(3/4), 310–332.)

https://en.wikipedia.org/wiki/Rule_of_succession : if there is a binary event like flipping a coin, and you've observed n of them, and heads came up h times, then the Rule of Succession estimate for the bias of the coin (the probability that it comes up heads) is (h+1/n+2). (this is like a Bayesian prior of 1 virtual head and 1 virtual tail events).

There is a heuristic (what is it's name?!?) that suggests that if you want to estimate how long some situation will persist/endure, guess that you are now about halfway through its lifetime; that is, that it will continue on for about as long as it has lasted up until this point, for a total duration of twice as long as it has been so-

rules of thumb for survey sample sizes, from Managing Survey Sample Size:

from Sampling and Sample Size:

Bootstrapping (bootstrap resampling)

You have a some data, but you don't know what distribution it comes from. You would like to sample from the distribution that the data comes from, so that you can compare something computed on the actual sample to that same number computed on a bunch of pseudo-samples drawn from the distribution; but you can't do that because you don't know the distribution.

So instead of drawing samples from the actual distribution, you fake it by drawing a sample by sampling with replacement from the actual data (the drawn sample will be the same size as the actual data sample). We say that "the data estimates its own distribution".

from https://speakerdeck.com/jakevdp/statistics-for-hackers (around slide 87)

don't use for rank-based statistics (eg. maximum value)

don't use if n < ~20

as usual, remember that the original sample must be representative

https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=98

bootstrapping for standard error

You can estimate the standard error by bootstrapping; each bootstrapped pseudo-sample gives you a new mean; do this many times and then compute the standard deviation of all of these means to get an estimate of the standard error.

bootstrapping for other things

can also use to eg. get std error-like things for eg. linear regression slope and intercept

https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=97

cross-validation for model order selection

as described in https://speakerdeck.com/jakevdp/statistics-for-hackers?slide=113

Randomly split data into subsets. For each subset, find the best model. Evaluate each model on the data it was not trained on. Now the graph of RMS error vs. model degree should have a global minima (where additional model degree leads to overfitting.

cointegration (timeseries)

auto- and cross-correlation (timeseries)

sum (integral) of autocorrelation (area under the autocorrelation function)

?: http://www.researchgate.net/publication/223411367_A_note_on_the_sum_of_the_sample_autocorrelation_function

power spectral density (timeseries)

IQR

interquartile range

circular statistics

hill estimator

gini coefficient

https://en.wikipedia.org/wiki/Gini_coefficient https://en.wikipedia.org/wiki/Gini_coefficient#Generalized_inequality_indices https://en.wikipedia.org/wiki/Generalized_entropy_index https://en.wikipedia.org/wiki/Income_inequality_metrics

"order statistics" and L-estimators and M-estimators

is a fancy phrase meaning the values of the sample, in sorted order. Usually "the first order statistic" is the min, the "second order statistic" is the second-smallest value in the sample, etc.

An L-estimator is an estimator which is a linear combination of order statistics. The median (and the other quantiles or percentiles) are L-estimators, as is the inter-quartile range (IQR). Many (but not all) L-estimators are more robust (technically, "statistically resistant") than non-L-estimator analogs (eg the median is more robust than the mean).

https://en.wikipedia.org/wiki/L-estimator https://en.wikipedia.org/wiki/Order_statistic

Elicitation of estimated distributions

Three-point estimation

https://en.m.wikipedia.org/wiki/Three-point_estimation

One way to elicit a distribution from a person is to ask them for a the best-case estimate ('a'), most likely estimate ('m'), and the worst-case estimate ('b').

Then we estimate the mean and standard deviation as:

    E = (a + 4m + b) / 6
    SD = (b − a) / 6 

Alternately, depending on how far in the 'tails' you think the person's a and b are, you could use "m" instead of "4m", and "3" instead of "6".

so a list of some summary statistics

for some code, see 'bsstatnums' in bshanks_scipy_misc/__init__.py, at eg https://sourceforge.net/p/neurodata/bshanks_scipy_misc/ci/2c939fc721a87d3c67a31980236a2c6895b32ffd/tree/__init__.py

and for time series data:

relationships between variables

correlation

alternatives to Pearson correlation

https://en.wikipedia.org/wiki/Distance_correlation

regression

http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-understand-linear-regression/

getting a sense for some statistics

correlation: http://guessthecorrelation.com/

doesn't belong here

since the median is appropriate sometimes (when you want insensitivity to outliers) and the mean other times (when you want sensitivity to tails), if you really want a single number, how about adjusting the trimean by adding in a mean?

(25%ile + 50%ile + mean + 75%ile) / 4

Is there a name for this sort of thing? Until i find it, maybe i'll call this the 'aggravated trimean'. I use 'aggravated' because, to those who want to use the trimean because it is robust, the addition of the mean will be aggravating.

In the example in Table 2 of http://onlinestatbook.com/lms/summarizing_distributions/comparing_measures.html , the aggravated trimean would be 962.75.

also, (mean - median)/((mean + median)/2) should be good for something.. in the case of Table 2 of the above webpage, (1183 - 500)/((500+1183)/2.) = .81, which shows you at a glace that there is tons of kurtosis, right? Is there a name for this sort of thing?

exponential distribution examples

" Imagine a room full of 100 people with 100 dollars each. With every tick of the clock, every person with money gives a dollar to one randomly chosen other person. After some time progresses, how will the money be distributed? ... this simple simulation arrives at a stationary distribution ((of wealth, if sorted)) with a skewed, exponential shape. This is due to the boundary at zero wealth which, we imagine, people don’t consider when they think about the problem quickly. "

[62]

two comments from https://news.ycombinator.com/item?id=14729400 :

jldugger 6 hours ago [-]

IMO, this reveals more about human intuition regarding randomness than whatever financial point it purports to make. It's still a useful point though.

Reminds me of the load balancing literature. There, the explicit goal is to evenly divide the burden across your fleet of servers, and having wide distributions is a problem on both ends: you're paying for servers to sit idle, and some are over burdened and giving customers a bad experience (high pageload times).

By way of illustration, I took the code and made a simple modification to it, implementing power of 2 random choice [63]

Here's the video result: https://www.youtube.com/watch?v=94Vc7gf3ONY Much tighter distribution, though you need to be able to identify the size of people's bank accounts. In this model, it's very rare for anyone to give the richest anything, unless you magically choose two people randomly tied for richest.

reply

my note: in this context, by 'power of 2 random choice' e means that in each cycle, each person randomly selected two people and then gave the poorer one $1

scottmsul 8 hours ago [-]

This kind of simulation results in an exponential distribution, which is fairly equal compared all things considered. In an exponential distribution, most people have roughly the same order of magnitude of wealth (10s to 100s of dollars). In real life, the bottom 99% follow an exponential distribution pretty closely, while the 1% follow a pareto distribution, which is WAY More unequal. The transition is very sharp too, and has been studied in econophysics models.

Brief introduction to econophysics for the mathematically inclined: https://arxiv.org/abs/0709.3662

reply

---

 brchr 11 hours ago [-]

In the 1960s, the biostatistician Marvin Zelen proposed using something very much like the Pólya urn for clinical trials, calling it the "play the winner" rule [1]. This has had a major effect in causing a rethinking of the traditional randomized controlled trial, and these ideas are still making their way through the medical community today [2].

[1] https://www.jstor.org/stable/2283724

[2] https://www.fda.gov/downloads/MedicalDevices/DeviceRegulatio...

reply

andy_wrote 10 hours ago [-]

Interesting - just perusing those links, it sounds like a multi-armed bandit problem, in which you reason that if something has worked out before, you should tilt your bets more in that direction. In the context of the urn model, you'd return more balls of the same color for every successful draw. In the context of medicine, you can balance between proving or disproving a treatment effect and actually supplying that treatment to the test subjects who need them.

Relatedly, there's a Bayesian interpretation to overweighting successful past draws. A model where you return one extra ball of the same color to the urn gets you a Dirichlet-multinomial distribution, which is a die-roll distribution where the weights to each face are not known for sure, but are given a probability distribution and revised with observed evidence. In other words: here's an n-sided die, I don't know its weightings, but as I observe outcomes I'll update my beliefs that the sides that come up are more favorably weighted. The number of balls in the urn you start with correspond to your priors; only 1 ball of each color means a very weak belief that it's a fair die, 1000 balls of each color means a strong belief, unequal numbers mean that you start off believing it's weighted.

---

kalman filters:

http://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/

http://htmlpreview.github.io/?https://github.com/aguaviva/KalmanFilter/blob/master/KalmanFilter.html

particle filters:

https://www.youtube.com/watch?v=aUkBa1zMKv4

---

Markov Chain Monte Carlo MCMC

https://jeremykun.com/2015/04/06/markov-chain-monte-carlo-without-all-the-bullshit/

---

a CDF is a percentile divided by 100. E.g. if a student is in the 99th percentile of their class, that means that 99% of the class has a score less than or equal to this student; so, if the students are sorted by score and then assigned integer identifiers in increasing order of score, cdf(this student's identifier) = .99. (note however that this explanation conflates the sample with the population, as the CDF traditionally refers to the population and the percentile traditionally refers to the sample; see https://en.wikipedia.org/wiki/Percentile#Definitions ).

---

"On the topic of the Monty Hall problem, what helped me "believe" it more was if you change it to 1,000,000 doors, still with only 1 car, and the rest goats. You choose 1 door. The host then opens up 999,998 other doors, which all contain goats. So there are 2 doors left. Your door, and the only other door the host didn't open. Do you feel at a gut level that you should switch? " -- will_pseudonym

---

Simpson's paradox

" Imagine there is a disease that kills 50% of the people who get it.

    There is a pill you can take if you get the disease. Of the people who take the pills, 80% die.
    It looks like the pills are killing people. But they aren't, they are helpful.

This is Simpson's paradox.

    What is really happening is that half the people with the disease have mild cases and half have severe cases. A patient with a mild case will get better on their own. But everyone with a severe case dies unless they receive treatment.
    People with mild cases don't bother to take the pills, because they are going to get better anyway.
    Only people with severe cases take the pills. 80% of them die, but without the pills they all would have died."

-- https://blog.plover.com/2021/10/02/

---

Links

toread:

---

some books someone recommended once: 1. Inference - Rohatgi 2. Inference - Stapleton

---