INTRODUCTION

Context

The concept of statistical significance is widely employed in medical research, especially in clinical and pharmacological studies and, at the same time, it is one of the most controversial, debated, and misunderstood topics since its original formulation1-3. In particular, it is often mistakenly believed that statistical testing can provide objective evidence about the real significance of phenomena (e.g. their existence or relevance). On the contrary, such a procedure is based on various hypotheses assumed to be true a priori and choices conditioned by an ineliminable margin of subjectivity1-8. Although the ambiguous concept of ‘significance’ was discussed by previous authors (e.g. William Sealy Gosset, otherwise known as Student), it was Sir Ronald Fisher who made it particularly famous in the 1920– 1930 decade9. After selecting an appropriate investigative methodology and assuming a priori that mere chance is the only phenomenon at play, a researcher can calculate the probability of obtaining an experimental result (the test statistic, e.g. the t of Student’s t-test) as or more extreme than that obtained in the experiment. This probability, referred to as the p-value, corresponds to the expected frequency of the statistical event within an infinite (very large) population of valid applications (i.e. where all background assumptions hold). In this regard, it is important to clarify some fundamental aspects. Firstly, Fisher’s approach involves establishing a series of hypotheses (the so-called statistical model) that are assumed to be perfectly met (true). These include underlying hypotheses (e.g. random sampling, normal distribution, linearity, etc.) and the null hypothesis of no effect or association. Moreover, as stressed by Greenland5, such background assumptions also involve human aspects (e.g. transparency, honesty, collaboration, competence, etc.). Once this is done, supposing a priori that no causal mechanism exists or we are exclusively in presence of a set of cofactors behaving randomly, it is a matter of assessing how ‘surprising’ (i.e. ‘statistically significant’) the obtained result compared to the null hypothesis prediction (previously set as true). Fisher (1920s) initially suggested a heuristic threshold of 5% to consider the outcome as unexpected (significant, p<0.05) or expected (non-significant, p≥0.05)10. This threshold was intended to work decently in most real applications. Subsequently, he regretted his own proposal, emphasizing that the p-value should be used as a graded measure of the strength of evidence against the null hypothesis2,11.

Egon Pearson and Jerzy Neyman (1933), critics of the idea of statistically evaluating the true significance of a hypothesis (a valid point), proposed instead a novel decision-theoretical approach (rule of behavior)12. The conventional goal, strictly conditional on the same underlying assumptions described above, is to establish two contrasting simple hypotheses: the null hypothesis of an exactly zero effect and the alternative hypothesis of non-exactly zero effect. If the experimental test statistic (e.g. Student’s t) is more extreme than a predetermined critical value (e.g. tc=1.96 in a very large sample), then the null hypothesis is arbitrarily rejected in favor of the alternative hypothesis; otherwise, the null hypothesis cannot be rejected (but neither accepted, although Neyman and Pearson originally used such a word). As specified by Neyman and Pearson themselves, the choice of this threshold is an open problem that, according to the latter Neyman, must be grounded in the evaluation of costs, risks, and benefits (as well as the selection of the hypothesis to be examined)13.

Applicability in scientific investigations

As explained by Fisher in 1955, the Neyman-Pearson approach (NP) can be useful in well-defined, limited contexts (e.g. inference regarding the proper functioning of a population of light bulbs produced by a factory), but it is generally not-recommendable in the scientific scenario14. In modern terms, Neyman-Pearson inference can be summarized as follows2,3: The critical region (e.g. z > z* = 1.96) can be defined in terms of decision p-values (e.g. p< α=0.05). Assuming the process is iterated in numerous equivalent applications (i.e. all background hypotheses are met in each of these), it amounts to committing a total of α⋅100% type I errors or ‘false positives’ (sometimes written as α%, i.e. the percentage version of α) and, if power (1-β)⋅100% is also fixed, of β⋅100% type II errors or ‘false negatives’. In other words, the p-value is a mere decision-making index devoid of direct scientific meaning (i.e. if α=0.05, p=0.049 and p=0.001 are decisionally equivalent as they lead to the same decision). The so-called statistical confidence is based on the concept of coverage probability: only in numerous equivalent applications, (1-α)⋅100% (e.g. 95%) of the confidence intervals of the form (1-α)⋅100% (e.g. 95%) will contain the population parameter. Thus, the first essential aspect is that such a framework never informs decisions on individual studies (e.g. it is incorrect to think that a 95% confidence interval has a 95% probability of containing the true value) since it is mathematically structured to operate merely on high numerosity under ideal conditions2,3,7,11,12. In addition, as evidenced by the (re)current ‘replication crisis’, equivalent conditions cannot be guaranteed in practice due to sources of scientific uncertainty that are not only difficult to model (e.g. researchers’ attention, confounding factors, proper sampling, etc.) but are also often unknown1-7,15-17. This leads to decisions that are inconsistent with the predetermined goal (due to what are sometimes called ‘Type III errors’)16. According to the recommendations of some of the leading global authorities in the field – including the American Statistical Association – and recent initiatives like the International Committee Against the Misuse of Statistical Significance (ICAMSS), the p-value should therefore be employed in a neo-Fisherian manner1,2,18-24. Specifically, the p-value is a continuous measure of the compatibility of the statistical result with the target hypothesis (e.g. the point null hypothesis of an exactly zero effect), whose interpretability in this sense is conditional on the background assumptions. The notion of ‘compatibility’ – which has been traced back to Karl Pearson25 (father of Egon) in 1900 – to indicate the degree of agreement of the data with the target hypothesis as evaluated by the chosen test, is a much more moderate expression than ‘support’ and it is not conceived to make terminal decisions. Indeed, supporting a hypothesis means assigning greater plausibility to the latter compared to others; on the contrary, showing a certain degree of compatibility with a hypothesis does not exclude the presence of other hypotheses that are equally or even more consistent with the data (as evaluated by the chosen statistical model) or the scientific phenomenon. Concerning mere statistics, p-values close to 1 indicate high compatibility, while p-values close to 0 indicate low compatibility. Hence, confidence intervals become compatibility intervals: for instance, a 95% compatibility interval of the form (x, y) contains all hypotheses whose p-value is greater than 0.05, meaning they are more compatible with the data than hypotheses predicting effects ‘x’ and ‘y’ (as conditionally assessed by the statistical test)7,22,26,27.

Common errors in public health

As extensively documented in the literature, there is a growing need to raise awareness within the medical community about the correct use of the aforementioned frequentist-inferential methods1. In light of the costs and risks linked to investigations in public health, it is essential to provide an overview of the most common errors and seek both short-term and long-term solutions. The first common flawed approach is the so-called null hypothesis significance testing (NHST), where only the point hypothesis of zero effect is considered and evaluated in dichotomous terms of ‘significance’ and ‘non-significance’24,28,29. Even in the utopian scenario where all background assumptions are perfectly met, a large p-value for the null hypothesis only indicates a high degree of compatibility of the latter with the data (as conditionally evaluated by the test) but does not in any way support such a hypothesis over others. An easy counterexample is as follows: Let (1–9) be an 80% compatibility interval associated with the best point estimate of a hazard ratio HR=3. The p-value for the mathematically null hypothesis HR*=1 is thus equal to p=0.20 (as HR=1 is the first limit of the 80% compatibility interval). Many would wrongly classify this outcome as ‘(statistically) non-significant’ only because the p-value for the null hypothesis is greater than 0.05; however, under the conditions described above, the data have exactly the same statistical compatibility with the decidedly non-null hypothesis HR*=9 (p=0.20, as HR=9 is the other limit of the 80% compatibility interval). But that is not all: the hypothesis most compatible with the data is not HR*=1 but HR*=3 (p=1, since HR=3 is the best point estimate). Thus, conditionally on the background assumptions, we can only conclude large statistical uncertainty and not the absence of any significance: indeed, this outcome is highly compatible with hypotheses of both low and broad effect7,8. The second issue, closely related to nullism (mere interest in the point null hypothesis), is the lack of distinction between large and small effect sizes. For instance, we could encounter situations where two (or more) hypotheses consistent with a low-magnitude phenomenon (e.g. HR*=1 and HR*=1.2) lead to quite different degrees of compatibility according to the adopted test (e.g. p=0.01 and p=0.05, respectively). If the best point estimate is consonant with a non-negligible effect (e.g. HR=2.4), such a scenario also signals high uncertainty7,8,20. The third issue is that, often, many authors mix Neyman and Pearson’s rule of behavior with Fisher’s significance testing, even though these approaches are based on mathematically and epistemologically incompatible formulations3. Therefore, this article discusses a possible approach to mitigate such misunderstandings.

METHODOLOGICAL APPROACH

Foundations

Human psychology – and thus all biases and inevitable sources of uncertainty that it carries with it – is an integral component of scientific investigations3,7,30,31. Since its earliest Bayesian formulations, modern statistics has been modeled on human perception, taking into account cognitive and even cultural aspects (e.g. Good 1952)32. In this regard, the reasons behind the vast success of NHST should be searched in university education and cognitive distortions aimed at oversimplifying complex concepts7,18,24,33. As a remedy, Rafi and Greenland26 propose to explain the ambiguous and unclear concept of ‘statistical significance’ through familiar statistical phenomena such as flipping an unbiased two-headed coin. The so-called ‘surprisal’ (or ‘S-value’) thus represents, conditionally on the background assumptions, the number of consecutive heads one would need to obtain – by flipping an unbiased two-headed coin – to match the statistical surprise of the result calculated in the experiment (the test statistic). This approach, subsequently extended to statistical compatibility via surprisal intervals7,8, resolves some thorny issues not only regarding the interpretation of p-values but also their mathematical-statistical utilization. Indeed, the p-value has been widely adopted by the neo-Fisherian statisticians as a graded/continuous measure of the refutational evidence against one or more hypotheses11,14,34. As recently demonstrated by Greenland3,5, such an interpretation is legitimate within this framework. However, this ‘divergence’ p-value (even if intended as a mere descriptive indicator of the discrepancy between observed data and the predictions of the statistical model) possesses counterintuitive properties. For instance, the difference in information content between p1=0.05 and p2=0.10 is larger than that between p3=0.95 and p4=1, despite 1 - 0.95 = 0.10 - 0.05 = 0.057. This occurs because the ratio p2/p1 is 2, while the ratio p4/p3 is less than 1.1 (i.e. the probabilities are quite different in the first pair and very similar in the second). If we compare p to the probability of obtaining S consecutive heads, by flipping an unbiased coin (a phenomenon of which we have immediate perception), we get that S= -log2p (from p=0.5S). Thus, the S-values in the four preceding cases are, respectively: S1=4.3, S2=3.3, S3=0.07, and S4=0. In other words, the first statistical result is as surprising as about 4 consecutive heads, the second is as surprising as about 3 consecutive heads, and the third and fourth are markedly less surprising than getting head when flipping an unbiased coin (compared to the model prediction). By doing so, the difference in information becomes evident (S2 - S1 = 1 while S4 - S3 = 0.07). Nevertheless, the current scenario is consistent with a widespread rejection of methodologies that are too innovative or complex. Accordingly, this study proposes and discusses a graded scale of statistical compatibility whose ranges are based on the information (surprisal) contained within.

Graded compatibility

Consistently with Fisherian indications, Muff et al.35 recently proposed a graded scale to read p-values as measures of evidence against a hypothesis. Although such an attempt has been subject to criticism, Amrhein and Greenland21 argue that the proposal of Muff et al.35 does ‘more good than harm’ since it contrasts the dichotomous, overconfident interpretation of statistical significance. Nevertheless, the term ‘evidence’ – that could be defined, according to the Oxford English Dictionary, as ‘facts or observations adduced in support of a conclusion or statement’ – transcends the actual epistemological capabilities of the p-value when drawing practical conclusions. Indeed, since a statistical hypothesis (SH) is just our attempt to represent an empirical hypothesis (EH) on a mathematical level, p-values could measure evidence against ‘SH’ but not ‘EH’: it all depends on our ability to select a proper SH based on EH. In this regard, we should also acknowledge that a real phenomenon might be too complex to be well-represented by simple statistical hypotheses. For these reasons, a framework to elaborate a graduated scale of mere compatibility is proposed here (Table 1). This should be constructed based on two main aspects: 1) the information contained within the range (surprisal), and 2) the predetermined scientific objective. Specifically, the first point aims to realize the gradation of the scale according to the degree of surprise (incompatibility) of the results compared with the fixed hypothesis (which does not necessarily have to be the null hypothesis of no effect) as conditionally assessed by the chosen test. The second point emphasizes that there is no absolute or unique way to evaluate a statistical result and that, as stated by Neyman13 in 1977, even the choice of the statistical hypothesis to investigate must be calibrated to the scientific objective.

Table 1

Compatibility ranges protocol. All the multiple thresholds should be established and published (with a digital object identifier, or DOI) before conducting the experiment

The ranges of p-valuesCompatibility rangeSurprisal range
α1 ≤ p ≤ 1MarkedMinimal
α2 ≤ p < α1HighWeak
α3 ≤ p < α2ModerateMarginal
α4 ≤ p < α3MarginalModerate
α5 ≤ p < α4WeakHigh
0 < p < α5MinimalMarked

The goal is to establish various thresholds – thus counteracting the dichotomous view – and, at the same time, to prevent the adoption of incorrect and misleading expressions such as ‘(non) significant’. Some heuristics for setting the various thresholds could be as follows: E1) Each subsequent threshold is half of the previous one (so that the S-value increases by 1 bit of information for each jump). For example, α1=0.250, α2=0.125, α3=0.063, α4=0.032, and α5=0.016; and E2) The thresholds are consistent with common ones (e.g. α1=0.20, α2=0.10, α3=0.05, α4=0.01, and α5=0.001).

However, it must be clear that, being a purely descriptive approach, the specification of these ranges aims only to simplify communication and limit overstatements. Moreover, the publication of a specific pre-study protocol lends much more weight – even in the eyes of editors and reviewers – to the issue of threshold selection and the evaluation of statistical compatibility in relation to the research scope.

Compatibility distributions and intervals

The selection of a specific target hypothesis concerning a single-point effect is generally insufficient to properly inform a scientific conclusion, since it does not allow us to evaluate the consistency of the experimental scenario with all relevant hypotheses. Recent literature proposes various ways to address this issue, like representing the so-called ‘p-distributions’ or ‘S-distributions’ (i.e. ‘compatibility distributions’ and ‘surprisal distributions’, respectively) to observe the (in)compatibility of the data with the set of all possible target hypotheses20,26. In this regard, some authors propose adding pre-study protocols to divide such hypotheses into different groups based on the effect size36. A practical example of application is provided in the Supplementary file. Nonetheless, these modalities of presentation are confined within manuscripts – as they are hardly communicable in introductory or summary sections such as abstracts – and are difficult to implement when dealing with several outcomes within the same study. To solve this problem, a novel convention for reporting multiple compatibility and surprisal intervals can be adopted7,8. According to the E2 protocol (Table 1), we could choose three compatibility intervals associated with the thresholds α1=0.20 (80% CI), α3=0.05 (95% CI), and α=0.01 (99% CI) as follows: 80|95|99% CI = (a–b|c–d|e–f). For instance, considering a calculated best point estimate of 10, if 80% CI = (6–14), 95% CI = (3–17), 99% CI = (0–20), we can write 80|95|99% CI = (6–14|3–17|0–20). This tells us that all hypotheses that predict an effect between 6 and 14 are, at least, highly compatible with the data (p>0.20, i.e. S<2.3). At the same time, all hypotheses between 3 and 17 are, at least, marginally compatible with the data (p>0.05, i.e. S<4.3). Finally, all hypotheses between 0 and 20 are, at least, weakly compatible with the data (p>0.01, i.e. S<6.6) or, equivalently, all hypotheses outside (i.e. those <0 or >20) are minimally compatible with the data (p<0.01, i.e. S>6.6).

DISCUSSION

We need descriptive approaches to reach causal inference

There are various non-descriptive methods for attempting to solve the problem of testing single-point hypotheses. Among these, the so-called ‘equivalence testing’ involves setting a target hypothesis in the form of a range and then adopting a dichotomous decisional rule of behavior. For example, when dealing with adverse events related to LDL cholesterol levels, a certain research group could define an effect as practically null when the average change falls between -5 and 5 mg/dL (range null hypothesis). However, as shown by Greenland3, this procedure does not escape the criticalities that permeate the standard Neyman-Pearson approach; rather, it introduces additional ones. Firstly, the dichotomous decision ‘rejection versus non-rejection’, to be made in individual studies, becomes extremely conditional on the choice of the initial null range (to be maintained in all studies). On this point, it is particularly complex to establish the width of this interval based on the scientific context and the associated costs, risks, and benefits – which could become clearer only over the course of various experiments – while taking into account biases and financial interests3,5,28,30. Secondly, the dichotomization of hypotheses creates regions of statistical equivalence that do not correspond to regions of scientific equivalence: for instance, an increase of 6 mg/dL is not scientifically equivalent to an increase of 30 mg/dL despite both point hypotheses belonging to the statistical alternative hypothesis (which assumes the form h < -5 or h > 5) within the same side (h > 5). Third, in many epidemiological situations such as the COVID-19 crisis, the whole scientific landscape is constantly changing (e.g. the occurrence of viral mutations can substantially alter the duration and symptoms of the disease), thus invalidating the equivalence request (e.g. the risk-benefit ratio can change drastically). This further underscores the necessity of engaging in critical thinking rather than relying on mere numerical criteria. Additionally, the generally overlooked aspect is that scientific inference requires consistency not only among statistical studies (which should always include randomized experiments to provide causal evidence) but also among extra-statistical evidence (analysis of biochemical mechanisms, clinical and medical observations, etc.). When this multidisciplinary set of reasoned, epistemic evidence converges in the same direction, causal inference can be claimed1,2,7,8,15,17,23,30. However, since we are forced to make dichotomous final decisions, such as approving or rejecting drugs, it is important to establish guidelines that serve as a good compromise while realizing inference. In valid repetitions, one should expect a therapeutic effect within an optimal pre-defined range in most cases, although the size of this effect must be assessed continuously or, at least, through a graduated scale; multiple ranges of effect (e.g. small, medium, large, etc.) could be defined in order to avoid dichotomization (Supplementary file). The expression ‘valid repetitions’ emphasizes the need to approach equivalence conditions as much as possible (indeed, minimizing sources of uncertainty remains fundamental) without the implausible expectation to perfectly achieve them. Concerning pharmacological development, there may be circumstances where the dose to administer must be lower, the treatment must be implemented for a shorter duration, or the initial clinical conditions are simply different. In such situations, it is appropriate to recalibrate the descriptive protocol or, if possible, establish a pre-study protocol of protocols that encompass various optimal, graded ranges based on different scenarios. The ultimate goal is to properly inform the so-called ‘non-terminal decisions’ (e.g. these findings are consistent with the treatment effectiveness, which justifies further research), which still require a broader clinical assessment (e.g. the absence of severe side effects, sustainable invasiveness, etc.).

CONCLUSIONS

This article discusses the epistemological, scientific, and statistical reasons supporting the descriptive approach over theoretical-decisional frameworks in public health. In particular, the strong dependence of the latter on assumptions that are too often violated, such as the absence of sources of uncertainty – including variability and bias – makes the latter not recommended in the medical field (e.g. replication crisis). In this regard, the proposed protocol for graded assessment of statistical compatibility aims to mitigate overstatement and bias as well as to avoid the dichotomization of scientific results into ‘significant’ and ‘non-significant’ based on a mere numerical criterion. The adoption of multiple compatibility or surprisal intervals can serve as a compromise between completeness and conciseness. It is recommended to adopt this or similar descriptive methods for scientific investigations in the soft sciences.