Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Excerpted from Greenland et al, Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations, 2016

What P values, confidence intervals, and power calculations don't tell us

Common misinterpretations of single P values

The P value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance of being true; if instead it gave P = 0.40, the null hypothesis has a 40 % chance of being true.

Show answer

No! The P value assumes the test hypothesis is true---it is not a hypothesis probability and may be far from any reasonable probability for the test hypothesis. The P value simply indicates the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the underlying statistical model). Thus P = 0.01 would indicate that the data are not very close to what the statistical model (including the test hypothesis) predicted they should be, while P = 0.40 would indicate that the data are much closer to the model prediction, allowing for chance variation.

The P value for the null hypothesis is the probability that chance alone produced the observed association; for example, if the P value for the null hypothesis is 0.08, there is an 8 % probability that chance alone produced the association.

Show answer

No! This is a common variation of the first fallacy and it is just as false. To say that chance alone produced the observed association is logically equivalent to asserting that every assumption used to compute the P value is correct, including the null hypothesis. Thus to claim that the null P value is the probability that chance alone produced the observed association is completely backwards: The P value is a probability computed assuming chance was operating alone.

Note: One often sees "alone" dropped from this description (becoming "the P value for the null hypothesis is the probability that chance produced the observed association"), so that the statement is more ambiguous, but just as wrong.

A significant test result (P ≤ 0.05) means that the test hypothesis is false or should be rejected.

Show answer

No! A small P value simply flags the data as being unusual if all the assumptions used to compute it (including the test hypothesis) were correct; it may be small because there was a large random error or because some assumption other than the test hypothesis was violated (for example, the assumption that this P value was not selected for presentation because it was below 0.05). P ≤ 0.05 only means that a discrepancy from the hypothesis prediction (e.g., no difference between treatment groups) would be as large or larger than that observed no more than 5 % of the time if only chance were creating the discrepancy (as opposed to a violation of the test hypothesis or a mistaken assumption).

A nonsignificant test result (P > 0.05) means that the test hypothesis is true or should be accepted.

Show answer

No! A large P value only suggests that the data are not unusual if all the assumptions used to compute the P value (including the test hypothesis) were correct. The same data would also not be unusual under many other hypotheses. Furthermore, even if the test hypothesis is wrong, the P value may be large because it was inflated by a large random error or because of some other erroneous assumption (for example, the assumption that this P value was not selected for presentation because it was above 0.05). P > 0.05 only means that a discrepancy from the hypothesis prediction (e.g., no difference between treatment groups) would be as large or larger than that observed more than 5 % of the time if only chance were creating the discrepancy.

A large P value is evidence in favor of the test hypothesis.

Show answer

No! In fact, any P value less than 1 implies that the test hypothesis is not the hypothesis most compatible with the data, because any other hypothesis with a larger P value would be even more compatible with the data. A P value cannot be said to favor the test hypothesis except in relation to those hypotheses with smaller P values. Furthermore, a large P value often indicates only that the data are incapable of discriminating among many competing hypotheses (as would be seen immediately by examining the range of the confidence interval). For example, many authors will misinterpret P = 0.70 from a test of the null hypothesis as evidence for no effect, when in fact it indicates that, even though the null hypothesis is compatible with the data under the assumptions used to compute the P value, it is not the hypothesis most compatible with the data---that honor would belong to a hypothesis with P = 1. But even if P = 1, there will be many other hypotheses that are highly consistent with the data, so that a definitive conclusion of "no association" cannot be deduced from a P value, no matter how large.

A null -hypothesis P value greater than 0.05 means that no effect was observed, or that absence of an effect was shown or demonstrated.

Show answer

No! Observing P > 0.05 for the null hypothesis only means that the null is one among the many hypotheses that have P > 0.05. Thus, unless the point estimate (observed association) equals the null value exactly, it is a mistake to conclude from P > 0.05 that a study found "no association" or "no evidence" of an effect. If the null P value is less than 1 some association must be present in the data, and one must look at the point estimate to determine the effect size most compatible with the data under the assumed model.

Statistical significance indicates a scientifically or substantively important relation has been detected.

Show answer

No! Especially when a study is large, very minor effects or small assumption violations can lead to statistically significant tests of the null hypothesis. Again, a small null P value simply flags the data as being unusual if all the assumptions used to compute it (including the null hypothesis) were correct; but the way the data are unusual might be of no clinical interest. One must look at the confidence interval to determine which effect sizes of scientific or other substantive (e.g., clinical) importance are relatively compatible with the data, given the model.

Lack of statistical significance indicates that the effect size is small.

Show answer

No! Especially when a study is small, even large effects may be "drowned in noise" and thus fail to be detected as statistically significant by a statistical test. A large null P value simply flags the data as not being unusual if all the assumptions used to compute it (including the test hypothesis) were correct; but the same data will also not be unusual under many other models and hypotheses besides the null. Again, one must look at the confidence interval to determine whether it includes effect sizes of importance.

The P value is the chance of our data occurring if the test hypothesis is true; for example, P = 0.05 means that the observed association would occur only 5 % of the time under the test hypothesis.

Show answer

No! The P value refers not only to what we observed, but also observations more extreme than what we observed (where "extremity" is measured in a particular way). And again, the P value refers to a data frequency when all the assumptions used to compute it are correct. In addition to the test hypothesis, these assumptions include randomness in sampling, treatment assignment, loss, and missingness, as well as an assumption that the P value was not selected for presentation based on its size or some other aspect of the results.

If you reject the test hypothesis because P ≤ 0.05, the chance you are in error (the chance your "significant finding" is a false positive) is 5 %.

Show answer

No! To see why this description is false, suppose the test hypothesis is in fact true. Then, if you reject it, the chance you are in error is 100 %, not 5 %. The 5 % refers only to how often you would reject it, and therefore be in error, over very many uses of the test across different studies when the test hypothesis and all other assumptions used for the test are true. It does not refer to your single use of the test, which may have been thrown off by assumption violations as well as random errors. This is yet another version of misinterpretation #1.

P = 0.05 and P ≤ 0.05 mean the same thing.

Show answer

No! This is like saying reported height = 2 m and reported height ≤2 m are the same thing: "height = 2 m" would include few people and those people would be considered tall, whereas "height ≤2 m" would include most people including small children. Similarly, P = 0.05 would be considered a borderline result in terms of statistical significance, whereas P ≤ 0.05 lumps borderline results together with results very incompatible with the model (e.g., P = 0.0001) thus rendering its meaning vague, for no good purpose.

P values are properly reported as inequalities (e.g., report " P < 0.02" when P = 0.015 or report " P > 0.05" when P = 0.06 or P = 0.70).

Show answer

No! This is bad practice because it makes it difficult or impossible for the reader to accurately interpret the statistical result. Only when the P value is very small (e.g., under 0.001) does an inequality become justifiable: There is little practical difference among very small P values when the assumptions used to compute P values are not known with enough certainty to justify such precision, and most methods for computing P values are not numerically accurate below a certain point.

Statistical significance is a property of the phenomenon being studied, and thus statistical tests detect significance.

Show answer

No! This misinterpretation is promoted when researchers state that they have or have not found "evidence of" a statistically significant effect. The effect being tested either exists or does not exist. "Statistical significance" is a dichotomous description of a P value (that it is below the chosen cut-off) and thus is a property of a result of a statistical test; it is not a property of the effect or population being studied.

One should always use two-sided P values.

Show answer

No! Two-sided P values are designed to test hypotheses that the targeted effect measure equals a specific value (e.g., zero), and is neither above nor below this value. When, however, the test hypothesis of scientific or practical interest is a one-sided (dividing) hypothesis, a one-sided P value is appropriate. For example, consider the practical question of whether a new drug is at least as good as the standard drug for increasing survival time. This question is one-sided, so testing this hypothesis calls for a one-sided P value. Nonetheless, because two-sided P values are the usual default, it will be important to note when and why a one-sided P value is being used instead.

Common misinterpretations of P value comparisons and predictions

Some of the most severe distortions of the scientific literature produced by statistical testing involve erroneous comparison and synthesis of results from different studies or study subgroups. Among the worst are:

When the same hypothesis is tested in different studies and none or a minority of the tests are statistically significant (all P > 0.05), the overall evidence supports the hypothesis.

Show answer

No! This belief is often used to claim that a literature supports no effect when the opposite is case. It reflects a tendency of researchers to "overestimate the power of most research" [89]. In reality, every study could fail to reach statistical significance and yet when combined show a statistically significant association and persuasive evidence of an effect. For example, if there were five studies each with P = 0.10, none would be significant at 0.05 level; but when these P values are combined using the Fisher formula [9], the overall P value would be 0.01. There are many real examples of persuasive evidence for important effects when few studies or even no study reported "statistically significant" associations [90, 91]. Thus, lack of statistical significance of individual studies should not be taken as implying that the totality of evidence supports no effect.

When the same hypothesis is tested in two different populations and the resulting P values are on opposite sides of 0.05, the results are conflicting.

Show answer

No! Statistical tests are sensitive to many differences between study populations that are irrelevant to whether their results are in agreement, such as the sizes of compared groups in each population. As a consequence, two studies may provide very different P values for the same test hypothesis and yet be in perfect agreement (e.g., may show identical observed associations). For example, suppose we had two randomized trials A and B of a treatment, identical except that trial A had a known standard error of 2 for the mean difference between treatment groups whereas trial B had a known standard error of 1 for the difference. If both trials observed a difference between treatment groups of exactly 3, the usual normal test would produce P = 0.13 in A but P = 0.003 in B. Despite their difference in P values, the test of the hypothesis of no difference in effect across studies would have P = 1, reflecting the perfect agreement of the observed mean differences from the studies. Differences between results must be evaluated by directly, for example by estimating and testing those differences to produce a confidence interval and a P value comparing the results (often called analysis of heterogeneity, interaction, or modification).

When the same hypothesis is tested in two different populations and the same P values are obtained, the results are in agreement.

Show answer

No! Again, tests are sensitive to many differences between populations that are irrelevant to whether their results are in agreement. Two different studies may even exhibit identical P values for testing the same hypothesis yet also exhibit clearly different observed associations. For example, suppose randomized experiment A observed a mean difference between treatment groups of 3.00 with standard error 1.00, while B observed a mean difference of 12.00 with standard error 4.00. Then the standard normal test would produce P = 0.003 in both; yet the test of the hypothesis of no difference in effect across studies gives P = 0.03, reflecting the large difference (12.00 − 3.00 = 9.00) between the mean differences.

If one observes a small P value, there is a good chance that the next study will produce a P value at least as small for the same hypothesis.

Show answer

No! This is false even under the ideal condition that both studies are independent and all assumptions including the test hypothesis are correct in both studies. In that case, if (say) one observes P = 0.03, the chance that the new study will show P ≤ 0.03 is only 3 %; thus the chance the new study will show a P value as small or smaller (the "replication probability") is exactly the observed P value! If on the other hand the small P value arose solely because the true effect exactly equaled its observed estimate, there would be a 50 % chance that a repeat experiment of identical design would have a larger P value [37]. In general, the size of the new P value will be extremely sensitive to the study size and the extent to which the test hypothesis or other assumptions are violated in the new study [86]; in particular, P may be very small or very large depending on whether the study and the violations are large or small.

Common misinterpretations of confidence intervals

Most of the above misinterpretations translate into an analogous misinterpretation for confidence intervals. For example, another misinterpretation of P > 0.05 is that it means the test hypothesis has only a 5 % chance of being false, which in terms of a confidence interval becomes the common fallacy:

The specific 95 % confidence interval presented by a study has a 95 % chance of containing the true effect size.

Show answer

No! A reported confidence interval is a range between two numbers. The frequency with which an observed interval (e.g., 0.72--2.88) contains the true effect is either 100 % if the true effect is within the interval or 0 % if not; the 95 % refers only to how often 95 % confidence intervals computed from very many studies would contain the true size if all the assumptions used to compute the intervals were correct. It is possible to compute an interval that can be interpreted as having 95 % probability of containing the true value; nonetheless, such computations require not only the assumptions used to compute the confidence interval, but also further assumptions about the size of effects in the model. These further assumptions are summarized in what is called a prior distribution, and the resulting intervals are usually called Bayesian posterior (or credible) intervals to distinguish them from confidence intervals [18].

An effect size outside the 95 % confidence interval has been refuted (or excluded) by the data.

Show answer

No! As with the P value, the confidence interval is computed from many assumptions, the violation of which may have led to the results. Thus it is the combination of the data with the assumptions, along with the arbitrary 95 % criterion, that are needed to declare an effect size outside the interval is in some way incompatible with the observations. Even then, judgements as extreme as saying the effect size has been refuted or excluded will require even stronger conditions.

If two confidence intervals overlap, the difference between two estimates or studies is not significant.

Show answer

No! The 95 % confidence intervals from two subgroups or studies may overlap substantially and yet the test for difference between them may still produce P < 0.05. Suppose for example, two 95 % confidence intervals for means from normal populations with known variances are (1.04, 4.96) and (4.16, 19.84); these intervals overlap, yet the test of the hypothesis of no difference in effect across studies gives P = 0.03. As with P values, comparison between groups requires statistics that directly test and estimate the differences across groups. It can, however, be noted that if the two 95 % confidence intervals fail to overlap, then when using the same assumptions used to compute the confidence intervals we will find P < 0.05 for the difference; and if one of the 95 % intervals contains the point estimate from the other group or study, we will find P > 0.05 for the difference.

An observed 95 % confidence interval predicts that 95 % of the estimates from future studies will fall inside the observed interval.

Show answer

No! This statement is wrong in several ways. Most importantly, under the model, 95 % is the frequency with which other unobserved intervals will contain the true effect, not how frequently the one interval being presented will contain future estimates. In fact, even under ideal conditions the chance that a future estimate will fall within the current interval will usually be much less than 95 %. For example, if two independent studies of the same quantity provide unbiased normal point estimates with the same standard errors, the chance that the 95 % confidence interval for the first study contains the point estimate from the second is 83 % (which is the chance that the difference between the two estimates is less than 1.96 standard errors). Again, an observed interval either does or does not contain the true effect; the 95 % refers only to how often 95 % confidence intervals computed from very many studies would contain the true effect if all the assumptions used to compute the intervals were correct.

If one 95 % confidence interval includes the null value and another excludes that value, the interval excluding the null is the more precise one.

Show answer

No! When the model is correct, precision of statistical estimation is measured directly by confidence interval width (measured on the appropriate scale). It is not a matter of inclusion or exclusion of the null or any other value. Consider two 95 % confidence intervals for a difference in means, one with limits of 5 and 40, the other with limits of −5 and 10. The first interval excludes the null value of 0, but is 30 units wide. The second includes the null value, but is half as wide and therefore much more precise.

Common misinterpretations of power

The power of a test to detect a correct alternative hypothesis is the pre-study probability that the test will reject the test hypothesis (e.g., the probability that P will not exceed a pre-specified cut-off such as 0.05). (The corresponding pre-study probability of failing to reject the test hypothesis when the alternative is correct is one minus the power, also known as the Type-II or beta error rate) [84] As with P values and confidence intervals, this probability is defined over repetitions of the same study design and so is a frequency probability. One source of reasonable alternative hypotheses are the effect sizes that were used to compute power in the study proposal. Pre-study power calculations do not, however, measure the compatibility of these alternatives with the data actually observed, while power calculated from the observed data is a direct (if obscure) transformation of the null P value and so provides no test of the alternatives. Thus, presentation of power does not obviate the need to provide interval estimates and direct tests of the alternatives.

For these reasons, many authors have condemned use of power to interpret estimates and statistical tests, arguing that (in contrast to confidence intervals) it distracts attention from direct comparisons of hypotheses and introduces new misinterpretations.

If you accept the null hypothesis because the null P value exceeds 0.05 and the power of your test is 90 %, the chance you are in error (the chance that your finding is a false negative) is 10 %.

Show answer

No! If the null hypothesis is false and you accept it, the chance you are in error is 100 %, not 10 %. Conversely, if the null hypothesis is true and you accept it, the chance you are in error is 0 %. The 10 % refers only to how often you would be in error over very many uses of the test across different studies when the particular alternative used to compute power is correct and all other assumptions used for the test are correct in all the studies. It does not refer to your single use of the test or your error rate under any alternative effect size other than the one used to compute power.

It can be especially misleading to compare results for two hypotheses by presenting a test or P value for one and power for the other. For example, testing the null by seeing whether P ≤ 0.05 with a power less than 1 − 0.05 = 0.95 for the alternative (as done routinely) will bias the comparison in favor of the null because it entails a lower probability of incorrectly rejecting the null (0.05) than of incorrectly accepting the null when the alternative is correct. Thus, claims about relative support or evidence need to be based on direct and comparable measures of support or evidence for both hypotheses, otherwise mistakes like the following will occur:

If the null P value exceeds 0.05 and the power of this test is 90 % at an alternative, the results support the null over the alternative.

Show answer

No! This claim seems intuitive to many, but counterexamples are easy to construct in which the null P value is between 0.05 and 0.10, and yet there are alternatives whose own P value exceeds 0.10 and for which the power is 0.90. Parallel results ensue for other accepted measures of compatibility, evidence, and support, indicating that the data show lower compatibility with and more evidence against the null than the alternative, despite the fact that the null P value is "not significant" at the 0.05 alpha level and the power against the alternative is "very high" [42].

tomsing1/statistical_misunderstandings.md

Select an option

No results found

Select an option

No results found

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

What P values, confidence intervals, and power calculations don't tell us

Common misinterpretations of single P values

Common misinterpretations of P value comparisons and predictions

Common misinterpretations of confidence intervals

Common misinterpretations of power