Home

# Statistical Inference (and What is Wrong With Classical Statistics)

## Scope

This page concerns statistical inference as described by the most prominent and mainstream school of thought, which is variously described as ‘classical statistics’, ‘conventional statistics’, ‘frequentist statistics’, ‘orthodox statistics’ or ‘sampling theory’. Oddly, statistical inference—to draw conclusions from the data—is never defined within the paradigm.

The practice of statistical inference as described here includes estimation (point estimation and interval estimation (using confidence intervals)) and significance tests (testing a null hypothesis and calculating p-values).

The important point is that all of these methods involve pretending that our sample came from an imaginary experiment that involved considering all possible samples of the same size from the population.

## History

The first formal significance test (Arbuthnott, 1710) correctly demonstrated that the excess of male births is statistically significant, but erroneously concluded that this was due to Divine Providence (intelligent design, rather than chance). Modern hypothesis testing is an anonymous hybrid of the tests proposed by Ronald Fisher (1922, 1925) on the one hand, and Jerzy Neyman and Egon Pearson (1933) on the other. Since Berkson (1938) people have questioned the use of hypothesis testing in the sciences. For a historic account of significance testing, see Huberty (1993).

## The frequentist interpretation of probability is very limited

A frequentist subscribes to the long run relative frequency interpretation of probability. This is defined as the limiting frequency with which that outcome appears in a long series of similar events. Dice, coins and shuffled playing cards can be used to generate random variables; therefore, they have a frequency distribution, and thus the frequency definition of probability theory can be used. Unfortunately, the frequency interpretation can only be used in cases such as these. The Bayesian interpretation of probability can be used in any situation.

## The nature of the null hypothesis test

Why should we choose between just two hypotheses, and why can't we put a probability on a hypothesis? A typical null hypothesis, that two populations means are equal, is daft: they will almost never be exactly equal. What does it mean to accept and reject a hypothesis? If a significance level is used to decide whether a null hypothesis is true or not, note that the level, such as 0.05, is totally arbitrary (the level effectively acts as a prior, but classical statisticians fail to appreciate this).

## Prior information is ignored

Almost all prior information is ignored and no opportunity is given to incorporate what we already know.

## Assumptions are swept under the carpet

The subjective elements of classical statistics, such as the choice of null hypothesis, determining the outcome space, the appropriate significance level and the dependence of significant tests on the stopping rule are all swept under the carpet. Bayesian methods put them where we can see them - in the prior.

With little loss of generality, let us consider a simple problem of inference. Assume that we have a large population with known mean and one sample. All of this makes up our evidence, E. Our hypothesis, H, is that the sample came from a different population (one with a different mean).

The frequentist theory of probability is only capable of dealing with random variables which generate a frequency distribution ‘in the long run’. We have one fixed population and one fixed sample. There is nothing random about this problem and the experiment is conducted only once, so there is no ‘long run’. So, versed in frequentist probability, what is our hapless orthodox statistician to do?

We pretend that the experiment was not conducted once, but an infinite number of times (that is, we consider all possible samples of the same size). Incredibly, all samples are considered equal, that is, our actual sample is not given any privileges over any other (imaginary) sample. We assume that each sample mean includes an ‘error’, which is independently and normally distributed about zero. Optimistically, we now claim that our sample was ‘random’. Voila! The sample mean now becomes our random variable, which we call our ‘statistic’. We can now apply the frequentist interpretation of probability.

We are now able to determine the (frequentist) probability of a (randomly chosen) sample mean having a value at least as extreme as our original sample mean. Note that we are implicitly assuming that the sample mean and the population mean are equal. This probability is our p-value which, incredibly, is assumed to apply to the original problem.

A method similar to that outlined above is common to all Fisher-Neyman-Pearson inference. The p-value also suffers from being an incoherent measure of support, in the sense that we can reject a hypothesis that is a superset of a second hypothesis without rejecting the second. P-values are not just irrelevant, they are dangerous because they are often misunderstood to be probabilities about the hypothesis, given the data (which would be far more intuitive). As the prominent Bayesian Harold Jeffreys observed, ‘What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred’ (Jeffreys, 1961).

In summary:

• What does it mean to accept and reject a hypothesis?
• ignores prior information
• assumptions swept under the carpet

## Important Publications

• ANDERSON, D.R., K.P. BURNHAM and W.L. THOMPSON, 2000. Null hypothesis testing: Problems, prevalence, and an alternative, The Journal of wildlife management 64, 912-923. [Cited by 340] (56.98/year)
Abstract: "This paper presents a review and critique of statistical null hypothesis testing in ecological studies in general, and wildlife studies in particular, and describes an alternative. Our review of Ecology and the Journal of Wildlife Management found the use of null hypothesis testing to be pervasive. The estimated number of P-values appearing within articles of Ecology exceeded 8,000 in 1991 and has exceeded 3,000 in each year since 1984, whereas the estimated number of P-values in the Journal of Wildlife Management exceeded 8,000 in 1997 and has exceeded 3,000 in each year since 1994. We estimated that 47% (SE = 3.9%) of the P-values in the Journal of Wildlife Management lacked estimates of means or effect sizes or even the sign of the difference in means or other parameters. We find that null hypothesis testing is uninformative when no estimates of means or effect size and their precision are given. Contrary to common dogma, tests of statistical null hypotheses have relatively little utility in science and are not a fundamental aspect of the scientific method. We recommend their use be reduced in favor of more informative approaches. Towards this objective, we describe a relatively new paradigm of data analysis based on Kullback-Leibler information. This paradigm is an extension of likelihood theory and, when used correctly, avoids many of the fundamental limitations and common misuses of null hypothesis testing. Information-theoretic methods focus on providing a strength of evidence for an a priori set of alternative hypotheses, rather than a statistical test of a null hypothesis. This paradigm allows the following types of evidence for the alternative hypotheses: the rank of each hypothesis, expressed as a model; an estimate of the formal likelihood of each model, given the data; a measure of precision that incorporates model selection uncertainty; and simple methods to allow the use of the set of alternative models in making, formal inference. We provide an example of the information-theoretic approach using data on the effect of lead on survival in spectacled eider ducks (Somateria fischeri). Regardless of the analysis paradigm used, we strongly recommend inferences based on a priori considerations be clearly separated from those resulting from some form of data dredging."

• WILKINSON, Leland and the Task Force on Statistical Inference, 1999. Statistical methods in psychology journals: Guidelines and explanations, American Psychologist Volume 54(8), August 1999, p 594-604. [Cited by 358] (51.38/year)
"Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept–reject decision is better than reporting an actual p value or, better still, a confidence interval. Never use the unfortunate expression “accept the null hypothesis.” Always provide some effect-size estimate when reporting a p value. Cohen (1994) has written on this subject in this journal. All psychologists would benefit from reading his insightful article."
Part of Conclusions: "Some had hoped that this task force would vote to recommend an outright ban on the use of significance tests in psychology journals. Although this might eliminate some abuses, the committee thought that there were enough counterexamples (e.g., Abelson, 1997) to justify forbearance. Furthermore, the committee believed that the problems raised in its charge went beyond the simple question of whether to ban significance tests."

• COHEN, J., 1994. The earth is round (p<. 05), American Psychologist. [Cited by 515] (43.03/year)
"After 4 decades of severe criticism, the ritual of null hypothesis significance testing—mechanical dichotomous decisions around a sacred .05 criterion—still persists. This article reviews the problems with this practice, including near universal misinterpretation of p as the probability that H0 is false, the misinterpretation that its complement is the probability of successful replication, and the mistaken assumption that if one rejects H0 one thereby affirms the theory that led to the test. Exploratory data analysis and the use of graphic methods, a steady improvement in and a movement toward standardization in measurement, an emphasis on estimating effect sizes using confidence intervals, and the informed use of available statistical methods are suggested. For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication."

• EFRON, B., 2004. Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis.. Journal of the American Statistical Association Vol. 99. [Cited by 69] (35.08/year)
Abstract: "Current scientific techniques in genomics and image processing routinely produce hypothesis testing problems with hundreds or thousands of cases to consider simultaneously. This poses new difficulties for the statistician, but also opens new opportunities. In particular it allows empirical estimation of an appropriate null hypothesis. The empirical null may be considerably more dispersed than the usual theoretical null distribution that would be used for any one case considered separately. An empirical Bayes analysis plan for this situation is developed, using a local version of the false discovery rate to examine the inference issues. Two genomics problems are used as examples to show the importance of correctly choosing the null hypothesis."

• JOHNSON, Douglas H., 1999. The insignificance of statistical significance testing, Journal of Wildlife Management 63(3):763-772. [Cited by 216] (30.99/year)
Abstract: "Despite their wide use in scientific journals such as The Journal of Wildlife Management, statistical hypothesis tests add very little value to the products of research. Indeed, they frequently confuse the interpretation of data. This paper describes how statistical hypothesis tests are often viewed, and then contrasts that interpretation with the correct one. I discuss the arbitrariness of P-values, conclusions that the null hypothesis is true, power analysis, and distinctions between statistical and biological significance. Statistical hypothesis testing, in which the null hypothesis about the properties of a population is almost always known a priori to be false, is contrasted with scientific hypothesis testing, which examines a credible null hypothesis about phenomena in nature. More meaningful alternatives are briefly outlined, including estimation and confidence intervals for determining the importance of factors, decision theory for guiding actions in the face of uncertainty, and Bayesian approaches to hypothesis testing and other statistical practices."
Conclusions: "Editors of scientific journals, along with the referees they rely on, are really the arbiters of scientific practice. They need to understand how statistical methods can be used to reach sound conclusions from data that have been gathered. It is not sufficient to insist that authors use statistical methods—the methods must be appropriate to the application. The most common and flagrant misuse of statistics, in my view, is the testing of hypotheses, especially the vast majority of them known beforehand to be false.
With the hundreds of articles already published that decry the use of statistical hypothesis testing, I was somewhat hesitant about writing another. It contains nothing new. But still, reading The Journal of Wildlife Management makes me realize that the message has not really reached the audience of wildlife biologists. Our work is important, so we should use the best tools we have available. Rarely, however, is that tool statistical hypothesis testing."

• KILLEEN, P.R., 2005. General Article An Alternative to Null-Hypothesis Significance Tests, Psychological Science, Volume 16, Number 5, May 2005, pp. 345-353(9). [Cited by 29] (29.95/year)
Abstract: "The statistic prep estimates the probability of replicating an effect. It captures traditional publication criteria for signal-to-noise ratio, while avoiding parametric inference and the resulting Bayesian dilemma. In concert with effect size and replication intervals, prep provides all of the information now used in evaluating research, while avoiding many of the pitfalls of traditional statistical inference."

• IOANNIDIS, J.P., 2005. Why most published research findings are false. PLoS Med. [Cited by 29] (29.65/year)
Summary: "There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research."

• SCHMIDT, F.L., 1996. Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers, Psychological Methods 1(2), 115-129. [Cited by 280] (28.09/year)
Abstract: "Data analysis methods in psychology still emphasize statistical significance testing, despite numerous articles demonstrating its severe deficiencies. It is now possible to use meta-analysis to show that reliance on significance testing retards the development of cumulative knowledge. But reform of teaching and practice will also require that researchers learn that the benefits that they believe flow from use of significance testing are illusory. Teachers must revamp their courses to bring students to understand that (a) reliance on significance testing retards the growth of cumulative research knowledge; (b) benefits widely believed to flow from significance testing do not in fact exist; and (c) significance testing methods must be replaced with point estimates and confidence intervals in individual studies and with meta-analyses in the integration of multiple studies. This reform is essential to the future progress of cumulative knowledge in psychological research."

• NEWCOMBE, R.G., 1998. Two-sided confidence intervals for the single proportion: comparison of seven methods, Statistics in Medicine, Volume 17, Issue 8, Pages 857 - 872. [Cited by 221] (27.74/year)
Abstract: "Simple interval estimate methods for proportions exhibit poor coverage and can produce evidently inappropriate intervals. Criteria appropriate to the evaluation of various proposed methods include: closeness of the achieved coverage probability to its nominal value; whether intervals are located too close to or too distant from the middle of the scale; expected interval width; avoidance of aberrations such as limits outside [0,1] or zero width intervals; and ease of use, whether by tables, software or formulae. Seven methods for the single proportion are evaluated on 96,000 parameter space points. Intervals based on tail areas and the simpler score methods are recommended for use. In each case, methods are available that aim to align either the minimum or the mean coverage with the nominal 1- ."

• FARRIS, J.S., et al., 1995. Constructing a Significance Test for Incongruence. Systematic Biology, Vol. 44, No. 4. (Dec., 1995), pp. 570-572. [Cited by 296] (26.99/year)

• WESTFALL, P.H. and S.S. YOUNG, 1993. Resampling-based multiple testing: examples and methods for p-value adjustment. John Wiley &amp; Sons. [Cited by 346] (26.67/year)

• GIGERENZER, G., S. KRAUSS and O. VITOUCH, 2004. The null ritual: What you always wanted to know about significance testing but were afraid to ask. The SAGE handbook of quantitative methodology for the social …. [Cited by 44] (22.37/year)

• KLEIN, D.F., 2005. Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. American Journal of Psychiatry. [Cited by 21] (21.69/year)

• KLAYMAN, J. and Y.W. HA, 1987. Confirmation, Disconfirmation, and Information in Hypothesis Testing, Psychological Review, 94, 211-228. [Cited by 408] (21.51/year)
Abstract: "Strategies for hypothesis testing in scientific investigation and everyday reasoning have interested both psychologists and philosophers. A number of these scholars stress the importance of disconfirmation in reasoning and suggest that people are instead prone to a general deleterious “confirmation bias.” In particular, it is suggested that people tend to test those cases that have the best chance of verifying current beliefs rather than those that have the best chance of falsifying them. We show, however, that many phenomena labeled “confirmation bias” are better understood in terms of a general positive test strategy. With this strategy, there is a tendency to test cases that are expected (or known) to have the property of interest rather than those expected (or known) to lack that property. This strategy is not equivalent to confirmation bias in the first sense; we show that the positive test strategy can be a very good heuristic for determining the truth or falsity of a hypothesis under realistic conditions. It can, however, lead to systematic errors or inefficiencies. The appropriateness of human hypothesis-testing strategies and prescriptions about optimal strategies must be understood in terms of the interaction between the strategy and the task at hand."

• DAVIES, R.B., 1977. Hypothesis testing when a nuisance parameter is present only under the alternative, Biometrika, Vol. 64, No. 2. (Aug., 1977), pp. 247-254. [Cited by 376] (19.82/year)
Abstract: "Suppose that the distribution of a random variable representing the outcome of an experiment depends on two parameters ξ and θ and that we wish to test the hypothesis ξ = 0 against the alternative ξ > 0. If the distribution does not depend on θ when ξ = 0, standard asymptotic methods such as likelihood ratio testing or C(α) testing are not directly applicable. However, these methods may, under appropriate conditions, be used to reduce the problem to one involving inference from a Gaussian process. This simplified problem is examined and a test which may be derived as a likelihood ratio test or from the union-intersection principle is introduced. Approximate expressions for the significance level and power are obtained."

• CHOW, S.L., 2000. Précis of Statistical significance: Rationale, validity, and utility, Behavioral and Brain Sciences 1998 Apr;21(2):169-94; discussion 194-239. [Cited by 111] (18.60/year)
Abstract: "The null-hypothesis significance-test procedure (NHSTP) is defended in the context of the theory-corroboration experiment, as well as the following contrasts: (a) substantive hypotheses versus statistical hypotheses, (b) theory corroboration versus statistical hypothesis testing, (c) theoretical inference versus statistical decision, (d) experiments versus nonexperimental studies, and (e) theory corroboration versus treatment assessment. The null hypothesis can be true because it is the hypothesis that errors are randomly distributed in data. Moreover, the null hypothesis is never used as a categorical proposition. Statistical significance means only that chance influences can be excluded as an explanation of data; it does not identify the nonchance factor responsible. The experimental conclusion is drawn with the inductive principle underlying the experimental design. A chain of deductive arguments gives rise to the theoretical conclusion via the experimental conclusion. The anomalous relationship between statistical significance and the effect size often used to criticize NHSTP is more apparent than real. The absolute size of the effect is not an index of evidential support for the substantive hypothesis. Nor is the effect size, by itself, informative as to the practical importance of the research result. Being a conditional probability, statistical power cannot be the a priori probability of statistical significance. The validity of statistical power is debatable because statistical significance is determined with a single sampling distribution of the test statistic based on H0, whereas it takes two distributions to represent statistical power or effect size. Sample size should not be determined in the mechanical manner envisaged in power analysis. It is inappropriate to criticize NHSTP for nonstatistical reasons. At the same time, neither effect size, nor confidence interval estimate, nor posterior probability can be used to exclude chance as an explanation of data. Neither can any of them fulfill the nonstatistical functions expected of them by critics."

• NICKERSON, R.S., 2000. Null hypothesis significance testing: A review of an old and continuing controversy, Psychological Methods. [Cited by 111] (18.60/year)
Abstract: "Null hypothesis significance testing (NHST) is arguably the most widely used approach to hypothesis evaluation among behavioral and social scientists. It is also very controversial. A major concern expressed by critics is that such testing is misunderstood by many of those who use it. Several other objections to its use have also been raised. In this article the author reviews and comments on the claimed misunderstandings as well as on other criticisms of the approach, and he notes arguments that have been advanced in support of NHST. Alternatives and supplements to NHST are considered, as are several related recommendations regarding the interpretation of experimental data. The concluding opinion is that NHST is easily misunderstood and misused but that when applied with good judgment it can be an effective aid to the interpretation of experimental data."

• GOODMAN, S.N., 1999. Toward evidence-based medical statistics. 1: The P value fallacy. Ann Intern Med. [Cited by 121] (17.35/year)
Abstract: "An important problem exists in the interpretation of modern medical research data: Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain “error rates,” without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used—the Bayes factor, which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings."

• HARLOW, L.L., S.A. MULAIK and J.H. STEIGER, 1997. What If There Were No Significance Tests?. erlbaum.com. [Cited by 151] (16.84/year)

• THOMPSON, B., 2002. What future quantitative social science research could look like: Confidence intervals for effect sizes, Educational Researcher v31 n3 p25-32 Apr 2002. [Cited by 66] (16.63/year)

• WILCOX, R.R., 1997. Introduction to robust estimation and hypothesis testing. Academic Press San Diego, CA. [Cited by 145] (16.17/year)

• BERKSON, J., 2003. Tests of significance considered as evidence. International Journal of Epidemiology. [Cited by 48] (16.16/year)

• GARDNER, M.J. and D.G. ALTMAN, 1986. Confidence intervals rather than P values: estimation rather than hypothesis testing.. Br Med J (Clin Res Ed). [Cited by 306] (15.32/year)
"Overemphasis on hypothesis testing--and the use of P values to dichotomise significant or non-significant results--has detracted from more useful approaches to interpreting study results, such as estimation and confidence intervals. In medical studies investigators are usually interested in determining the size of difference of a measured outcome between groups, rather than a simple indication of whether or not it is statistically significant. Confidence intervals present a range of values, on the basis of the sample data, in which the population value for such a difference may lie. Some methods of calculating confidence intervals for means and differences between means are given, with similar information for proportions. The paper also gives suggestions for graphical display. Confidence intervals, if appropriate to the type of study, should be used for major findings in both the main text of a paper and its abstract."

• BRAUMOELLER, B.F., 2004. Hypothesis Testing and Multiplicative Interaction Terms. International Organization. [Cited by 28] (14.23/year)
Abstract: "When a statistical equation incorporates a multiplicative term in an attempt to model interaction effects, the statistical significance of the lower-order coefficients is largely useless for the typical purposes of hypothesis testing. This fact remains largely unappreciated in political science, however. This brief article explains this point, provides examples, and offers some suggestions for more meaningful interpretation."

• ROSENTHAL, R., 1979. The “file drawer problem” and tolerance for null results. Psychological Bulletin 86, 638-641. [Cited by 380] (14.09/year)

• HANSEN, B.E., 1997. Approximate Asymptotic P Values for Structural-Change Tests. Journal of Business & Economic Statistics. [Cited by 124] (13.82/year)

• TRAFIMOW, D., 2003. Hypothesis testing and theory evaluation at the boundaries: Surprising insights from Bayes' s …. Psychological Review. [Cited by 41] (13.81/year)
Abstract: "Because the probability of obtaining an experimental finding given that the null hypothesis is true [p(F\H0)] is not the same as the probability that the null hypothesis is true given a finding [p(H0\F)], calculating the former probability does not justify conclusions about the latter one. As the standard null-hypothesis significance-testing procedure does just that, it is logically invalid (J. Cohen, 1994). Theoretically, Bayes's theorem yields p(H0\F), but in practice, researchers rarely know the correct values for 2 of the variables in the theorem. Nevertheless, by considering a wide range of possible values for the unknown variables, it is possible to calculate a range of theoretical values for p(H0\F) and to draw conclusions about both hypothesis testing and theory evaluation."

• MASSON, M.E.J. and G.R. LOFTUS, 2003. Using confidence intervals for graphically based data interpretation. Canadian Journal of Experimental Psychology. [Cited by 40] (13.48/year)

• CUMMING, G. and S. FINCH, 2001. A Primer on the Understanding, Use, and Calculation of Confidence Intervals That Are Based on …. Educational and Psychological Measurement. [Cited by 63] (12.68/year)

• LENHARD, J., 2006. Models and Statistical Inference: The Controversy between Fisher and Neyman-Pearson, The British Journal for the Philosophy of Science 57(1):69-91. [Cited by 3] (12.55/year)

• WRIGHT, S.P., 1992. Adjusted P-Values for Simultaneous Inference. Biometrics. [Cited by 163] (11.66/year)

• KOCH, K.R., 1988. Parameter estimation and hypothesis testing in linear models. Springer-Verlag New York, Inc. New York, NY, USA. [Cited by 195] (10.85/year)

• TVERSKY, A. and D. KAHNEMAN, 1971. Belief in the law of small numbers. Psychological Bulletin, 76, 105-110. [Cited by 371] (10.61/year)

• THOMPSON, Bruce, 1996. AERA Editorial Policies regarding Statistical Significance Testing: Three Suggested Reforms, Educational Researcher, Vol. 25, No. 2. (Mar., 1996), pp. 26-30. [Cited by 105] (10.53/year)
Abstract: "The present comment reviews practices revolving around tests of statistical significance. First, the logic of statistical significance testing is presented in an accessible manner; many people who use statistical tests might not place such a premium on the tests if these individuals understood what the tests really do, and what the tests do not do. Second, the etiology of decades of misuse of statistical tests is briefly explored; we must understand the bad implicit logic of persons who misuse statistical tests if we are to have any hope of persuading them to alter their practices. Third, three revised editorial policies that would improve conventional practice are highlighted."

• BELIA, S., et al., 2005. Researchers misunderstand confidence intervals and standard error bars. Psychological Methods. [Cited by 10] (10.33/year)

• STEPHENS, P.A., et al., 2005. Information theory and hypothesis testing: a call for pluralism. Journal of Applied Ecology. [Cited by 10] (10.33/year)

• THEILER, J. and D. PRICHARD, 1996. Constrained-realization Monte-Carlo method for hypothesis testing. Physica D. [Cited by 100] (10.03/year)
Abstract: "We compare two theoretically distinct approaches to generating artificial (or ``surrogate'') data for testing hypotheses about a given data set. The first and more straightforward approach is to fit a single ``best'' model to the original data, and then to generate surrogate data sets that are ``typical realizations'' of that model. The second approach concentrates not on the model but directly on the original data; it attempts to constrain the surrogate data sets so that they exactly agree with the original data for a specified set of sample statistics. Examples of these two approaches are provided for two simple cases: a test for deviations from a gaussian distribution, and a test for serial dependence in a time series. Additionally, we consider tests for nonlinearity in time series based on a Fourier transform (FT) method and on more conventional autoregressive moving-average (ARMA) fits to the data. The comparative performance of hypothesis testing schemes based on these two approaches is found to depend on whether or not the discriminating statistic is pivotal. A statistic is ``pivotal'' if its distribution is the same for all processes consistent with the null hypothesis. The typical-realization method requires that the discriminating statistic satisfy this property. The constrained-realization approach, on the other hand, does not share this requirement, and can provide an accurate and powerful test without having to sacrifice flexibility in the choice of discriminating statistic."