



STATISTICAL RESOURCE 

Year : 2021  Volume
: 4
 Issue : 4  Page : 756762 

To “P” or not to “P”, that is the question: A narrative review on P value
HS Darling
Department of Medical Oncology and HematoOncology, Command Hospital Air Force, Bengaluru, Karnataka, India
Date of Submission  10Sep2021 
Date of Decision  17Nov2021 
Date of Acceptance  13Dec2021 
Date of Web Publication  29Dec2021 
Correspondence Address: H S Darling Department of Medical Oncology and HematoOncology, Command Hospital Air Force, Bengaluru  560 007, Karnataka India
Source of Support: None, Conflict of Interest: None  4 
DOI: 10.4103/crst.crst_222_21
Best medicine practice is thought to be based on evidence. Inferential statistics allow us to establish the strength of evidence in favor of or against a new research finding. The P value has been considered a reliable universal marker indicative of statistical significance of a study, thus driving the majority of practicechanging developments. Of late, the reign of the P value has been increasingly challenged by failure of replication of results in successive studies necessitating withdrawals of drug approvals. For the purpose of this narrative review, we performed a detailed literature search to identify relevant articles from the PubMed database and Cochrane library. We aimed to evaluate the drawbacks of the utilization of P value in a dichotomous way around a fixed cutoff of 0.05. Our review suggests that the P value must be interpreted as a continuum with smaller values depicting greater significance. The possible substitutes for P value are also discussed to enable a rational interpretation of results of new discoveries.
Keywords: Clinical research, trials, P value, statistical significance
How to cite this article: Darling H S. To “P” or not to “P”, that is the question: A narrative review on P value. Cancer Res Stat Treat 2021;4:75662 
Introduction   
All that comes to view is perishable, untrue, mythical, and changeable (Ḏaristimān hai sagal mithenā). This statement by the fifth Sikh Guru, Sri Guru Arjun Dev Ji, made more than 4 centuries ago aptly represents our dwindling state of unshakable trust in the P value.^{[1]} P value is the first and probably the only concept of complex statistics which most of us can understand and remember, since our graduation days. The medical community has based its interpretation of the majority of research studies on P value alone. The concept of interpretation of P < 0.05 as statistically significant and >0.05 as statistically nonsignificant in a dichotomous manner has generally been considered gospel truth. Recently, more and more concerns are being raised within the research community about the utility of P value as the sole invincible parameter of statistical significance. In addition, studies reporting positive outcomes on the basis of P < 0.05 have failed to replicate their findings in subsequent studies, questioning the claims of the previous studies. The P value tells us a lot about the experiment that was performed and not the hypothesis itself. Multiple experiments with different power and variables to test the same hypothesis might lead to wildly different results, with the blame placed on the poor P value. Efforts are on to find a robust substitute or adjustment to the P value, which can provide more valid and sufficiently reproducible results. Nonetheless, despite its limitations, the P value has useful qualities. It is a single number that allows for an objective interpretation of data.
Methods   
A thorough search was performed to identify relevant published literature from various sources, including the PubMed database and Cochrane library. The search and selection process for articles is depicted in [Figure 1]. The abovementioned resources were searched using the key terms “P value,” “statistical significance,” and “clinical trials.” A total of 8258 articles were identified from the search. After removing 1245 duplicate records, a total of 7013 articles were screened for eligibility for inclusion in this review. Finally, a total of 6689 articles were excluded because the content was not relevant to this review and 34 articles with relevant information and illustrations were included. We have used certain hypothetical examples at various places in the article to explain the concepts numerically, where the complete calculations were beyond the scope of this review.  Figure 1: Search strategy for the articles for the narrative review on controversies related to the P value
Click here to view 
The Origin and Conventional Domain of the P Value   
To begin with, let us revisit the null hypothesis and significance testing. Although the P value or probability value is credited to Pearson, the concept of significance testing as a method to test a hypothesis was first developed by Fisher in the year 1925.^{[2]} This model is based on the null hypothesis without reference to any alternative hypothesis. Fisher proposed the use of P value as a quantitative yardstick against the null hypothesis. He recommended that scientists should determine a threshold for the P value which would be sufficient to refute the null hypothesis for a particular study. He termed this as the level of significance (α), which may be 0.05 or 0.01 as decided by the researcher. In 1956, Fisher reiterated that no researcher has a fixed level of significance, using which from year to year and in all circumstances, the null hypothesis can be rejected; rather he thinks about each particular case in light of the evidence and his ideas.^{[3]} The inability to reject the null hypothesis on the basis of observed data does not imply that the null hypothesis is true. Conventionally, 5% became the commonly chosen significance value as a modest way of representing evidence opposing the null hypothesis, which is expected only 5 out of 100 times when there is nil effect. Gradually, it led to the convention of publishing selected results showing P < 0.05, leading to dichotomization of all study results as either success (significant) or failure (nonsignificant). The race to “publish or perish” inadvertently diverted attention away from interpreting the P value as a continuous measure.^{[4]}
P Value and the Alternative Hypothesis   
If we compare two randomly assigned groups each with 40 patients receiving either therapy or placebo, resulting in a continuous outcome with a common variance, 2, the bidirectional P values corresponding to the obtained values of difference, 0.6 and 0.8, will be P = 0.058 and P = 0.011, respectively. The greater the difference, the more is the evidence against the null hypothesis and lower will be the P value. Thus, differences as large as or >0.8 have only a 1.14% probability (1 in 90) and are not expected if there is nil effect.
P value, as described above, is used only against the null hypothesis. However, in reality, we are rather interested in knowing whether and how strongly the alternative hypothesis is true. The introduction of an alternative hypothesis of interest has generated a new concept to compare it with the null hypothesis.^{[5]} In this method, two types of errors are possible, namely the type I error (significance level α) where the null hypothesis is falsely rejected although it is actually true (false positive), or the type 2 error (β) where the researcher fails to reject the null hypothesis when it is false (a false negative). A researcher fixes α and the power (1–β) in advance and plans experiments to correctly reject the null hypothesis for an effect size of interest. This allows a researcher to reject either the null or the alternative hypothesis. In the example from the above paragraph, taking α = 0.05 gives 47.6% and 81.3% power to detect the differences of 0.6 and 0.9, respectively.^{[4]}
Are we Suffering from Replication Crisis?   
We are in the era of evidencebased medicine. Our notion about the efficacy of drugs is driven by the results of clinical trials. However, in reality, many times the drugs are found to behave in a manner different than that demonstrated by a randomized, wellconducted clinical trial. Similarly, many times, we find that there are two trials on the same drug or intervention yielding contradictory results (e.g., nivolumab monotherapy in hepatocellular carcinoma and pembrolizumab for secondline treatment of metastatic smallcell lung cancer).^{[6],[7]} As we dissect deeper into the statistical aspects of these studies, it appears that these studies are vulnerable to statistical manipulation. The inability of many confirmatory studies to produce benefits similar to those seen in the initial studies is termed the replication crisis. Let us explore whether it exists in reality or is a perceived threat.
Logically, the failure to replicate should be a rarity. However, in reality, it is not uncommon to see an exciting and significant initial finding disappearing in subsequent research. A replication crisis is the consequence of excessive optimism. It is the problem with interpretation which leads to false discoveries, partially propelled by the urge to derive positive results in clinical trials. Incorrect interpretation causes failure to replicate results in successive trials. The robustness of scientific evidence is measured by its reproducibility. Assuming that more citations suggest more “interesting” findings, a negative correlation between replicability and citation count indicates a complacent, biased review process.^{[8]}
Root Cause of Replication Crisis   
Siegfried proposed that contradictory research discoveries are not uncommon, because researchers rely on P values to interpret the findings. P value is not the only cause of contradictory or weaker results in subsequent studies. Parameters are not measured using P value; instead, they are measured using the estimator β. This estimator and its standard error extract information about the systematic component of data variation represented by β. A deeper study of this suggests that P values provide diagnostic or warning mechanisms for hypothesis or model problems, and like all such mechanisms, are fallible. The difference in replication might also be due to sample size. Addressing multiple objectives with collected findings leads to unreliable biased results and exaggerates the effect size. Exclusively bringing out positive results based on very small trials increases the difference. Just highlighting P values instead of the actual difference masks the real impact of the intervention. Overlooking the basic demarcation between exploratory and confirmatory studies also accentuates the findings. Expecting the study findings to replicate the earlier studies heralds a replication crisis.^{[4]}
Conventionally, the null hypothesis presumes no effect. However, in reality, some effect is inevitable in case of any intervention, and the magnitude and direction of the effect are what one aims to assess. Phase II trials are based on the belief that the drug has some effect. A different magnitude of effect than expected changes the distribution of the P value. A failed phase III trial actually reflects a smaller than predicted effect, rather than no effect.
Colquhoun demonstrated that the socalled significant P values of just below 0.05 were extremely weak evidence against the null hypothesis, as onethird of these results were false positives. The dichotomous interpretation of “significant” or “not significant”is particularly harmful for many reasons, the most pertinent being that this approach encourages failed replication. Studies are often planned with 80% power, i.e., 80% chance to detect an effect. If there is a true difference between two arms, the probability of two similar studies with 80% power giving P < 0.05 is at best 80% × 80% = 64%, whereas the probability of one of these studies giving P < 0.05 and the other not is 64%/2 = 32%. This reflects that P value is highly insufficient to provide evidence against the null hypothesis, and thus, should be considered as providing only loose, firstpass evidence about the phenomenon being studied.^{[9]}
The idea behind exploratory trials is hypothesizing queries, which confirmatory studies try to answer scientifically. In the early phases of clinical trials, P value variations depend on investigator sampling intentions. Postrandomization reallocations/exclusions also have a part to play. More false positive inferences in the exploratory research translate into more false discoveries, whereas more false negative inferences may miss valuable discoveries.^{[10]} The false discovery proportion is the ratio of false positive study results within all affirmative inferences.^{[11]} Looking at only positive findings, low power, and addressing multiple research questions simultaneously increase the false discovery proportion. A relatively stringent threshold (e.g., P < 0.001) might look enticing; however, it will mandate a larger study population to exhibit similar power. False discoveries portend an issue in all types and phases of drug development. A very interesting study surveyed 1738 projects to evaluate the probability of a clinical trial progressing to the next level at various stages. The estimated probabilities of success were 71%, 45%, 64%, and 93% for progressing from phase I to phase II, phase II to phase III, phase III to application, and application to drug approval, respectively. The overall success rate for any drug research project from phase I to approval was 19% for the whole program. Similar results were reported by other studies too.^{[12]}
Challenges Around P Value's Popularity   
A bothersome issue with P value is whether the strength of evidence dictated by it is reliable or not. P values, confidence intervals, and other statistical measures all have their place, but it is time to bid adieu to dichotomous interpretation and manipulation of statistical significance. One reason to avoid “dichotomania” is that all statistics, including P values and confidence intervals, remarkably vary from study to study. In fact, random data variation alone can manifest as far apart P values. Even two perfect replication studies of real effect, each with 80% power and α <0.05, may yield one P < 0.01 and the other P > 0.30.^{[13]}
In fact, the P value is only indicative of the null hypothesis being true or false, but more often than not, we are actually interested in more information about the alternative hypothesis. If the P value is high and does not allow for rejecting the null hypothesis, we are left with an “open verdict.” By increasing the sample size, inevitably the null hypothesis can be rejected.^{[14]}
Should the P Value be Abandoned?   
Despite the mounting controversies, there is no consensus whether the use of P value should be continued or eliminated altogether. To deal with the fear of failure to replicate, various international journals have recently published several expert articles. Of late, the American Statistical Association (ASA) brought out more than 40 publications about the drawbacks of using the P value in research methods. The ASA editorial recommended that researchers should completely do away with the term “statistically significant”, although the writeups did not have similar recommendations.^{[15]} In contrast, the New England Journal of Medicine suggested that despite such concerns, the P value continues to have an important role, and significance tests must not be discarded completely.^{[16]} A Clinical Trials editorial cautioned that significance testing should still remain in place.^{[17]}
How to Correctly Interpret The Strength of Evidence from the P value   
Judging the evidence requires a comprehensive analysis of the entire data depending on the study design, rather than looking at just the P value. The study samples have uniform distribution for continuous data with respect to the null hypothesis. It is highly skewed under the alternative hypothesis, depending on sizes of both the effect and the sample.^{[18]} Hence, for large sample sizes, the logarithmic values are taken, which show almost a normal distribution. The P value decreases with an increase in the effect size, and it is much <0.05 in a sufficiently powered study. Thus, the expected P value in such a study is 0.001 at 90% power, making the P value of 0.05 rather negative evidence of a real difference. P values are always measured on a log scale. For example, evidence depicted by P = 0.03 is not twice as much as that depicted by P = 0.06. In fact, P = 0.03 depicts twofolds evidence as compared to P = 0.30.^{[19]}
We can also quantify the robustness of the P value as a strength of evidence by reproducibility probability. It is the replicability of an initial study to produce statistically significant results in a subsequent study. It is derived by calculating the required power for the next trial based on the values obtained from the previous trial. For example, the calculated reproducibility probability with a starting P_{obs} = 0.05 estimates only a 50% probability that P_{new} < 0.05. In contrast, the calculated reproducibility probability with a starting P_{obs} = 0.001 estimates a 90% probability that P_{new} < 0.05, demonstrating that P = 0.001 is sufficiently better evidence.^{[19],[20]}
The strength of the P value correlates with the magnitude of the effect. For example, if n = 40 and variance = 2, with 90% power to find the effect size of 1, about 80% of the P <0.01, 50% are <0.001, and 30% are <0.0001. With the same study methods at 45% power to find the effect size of 0.6, only 30% of the P < 0.01 and 10% are <0.001.^{[4]}
We can also look at the P value according to the domain of desired effects it represents. If in the null hypothesis, P > 0.05 indicates no effect, a P value threshold (P*) representing a meaningful difference can be calculated. The null hypothesis can be rejected if P < P*, or when the bottom bound of a 95% confidence interval is more than the desired difference. For instance, if n = 40 and variance = 2, with the lowest desirable effect of 1, as expected, excluding a nil effect with 95% confidence requires the P value to be less than P* = 0.05. For δ* = 0.42 with 95% confidence, a P value below P* = 0.001 will be obtained only 53% of the time. For an effect of 0.61, a P < 0.0001 will be obtained only 29% of the time. For a difference of 0.78, a P < 0.00001 will be obtained only 13% of the time. Finally, for a difference of 1, a P < 0.0000001 will be obtained only 2% of the times.^{[21]}
The association between the obtained P value and the proportion of the obtained effect size to the desired effect size (R) is also an indicator of the strength of evidence. In a study with 88% power to conclude a desired effect size for an α of 0.05 (twotailed), R will be 1 if P = 0.001, 0.6 if P = 0.05, 0.5 if P = 0.10, and 0.4 if P = 0.20. Borderline results in a sufficiently powered study show a lesser than expected effect magnitude.^{[4]}
Early phases of clinical trials often evaluate several outcomes with several intervention groups, studying many participant subsets, leading to multiple statistical derivations. The possibility of potential inflation of the type I error when multiple research questions are tested is called multiplicity. Selective inference is the questionable custom of deciding the study objective depending on the most interesting results after the completion of the study. These practices augment the chances of the replication crisis. The familywise error rate is the probability of at least one type I error occurring in a single study. A clinical trial testing 10 hypotheses will have a familywise error rate of 40%. The chances of incorrect results increase, especially in underpowered studies. As a consequence of selective inference, replication as well as anticipated effects are attenuated.^{[22]}
The selfdriven selection bias of publishing only the conventionally validated statistically significant results based on a fixed P value boundary is publication bias. It is seen that predicting the effect size using only supposedly positive published studies can surprisingly overestimate the true effect size and lead to an underpowered subsequent study.^{[23]} To circumvent this fallacy, the International Committee of Medical Journal Editors now mandates clinical trial registration in a centralized registry prior to the recruitment of the first participant to be eligible for publication. The researchers must disclose the statistical plan, research protocols, and participant details to enhance the transparency of research findings.^{[24]} In poorly designed and conducted studies, contradictory results are not uncommon.^{[25]}
Exploratory research is a journey of an idea to a scientific question that needs confirmation in a more rigorous study. Exploratory studies of drug development focus on establishing any measurable effect, predictive or prognostic biomarkers, and any adverse outcomes along with defining the drug efficacy. Conversely, the study protocols in confirmatory trials are preplanned and depend on the findings of prior exploratory studies. The sanctity and workability of medical research require exploratory studies to be clearly demarcated from confirmatory trials.^{[4]}
The power of P value can be augmented by additional information about the uncertainty of this parameter. P value prediction interval for the undertaken study or a future replication study is a way to achieve this. This can be accomplished using a simple online calculator for both types of studies. For example, if P = 0.01, it will have a 95% prediction interval of 5.726–0.54. Similarly, if P = 0.0001, the 95% prediction interval is 0–0.05. In the second scenario, the 95% prediction interval of a future replication study is 0–0.26. The width of the prediction interval is surprisingly large as compared to the naked single P value reported to great precision.^{[26]}
Remedies   
A new drug approval should ideally be based on confirmatory evidence in the form of two successful, welldesigned studies with twotailed P < 0.05. Choosing 0.05 as a P value threshold is simply the “conventional” approach.^{[27]} This ensures that the false approval rate is less than 0.00125. The second study may have a different design or population. A solitary research study demonstrating P < 0.00125 or at least P = 0.001 may also lead to regulatory approval.^{[28]}
Various solutions have been proposed to resolve the P value dilemma. Some of these include redefining statistical significance, removing statistical significance, justifying the level of significance or the Bayes factor alternative. Bayes factor is the P value shown as a Bayesian alternative, as a measure of the strength of the data. It measures the evidence in the collected data against the null hypothesis and favoring the alternative hypothesis, and is calculated as a proportion of posterior and prior odds. If the alternative hypothesis is equally likely prior to the study, the Bayes factor is the same as the posterior odds against the null hypothesis and favoring the alternative hypothesis.^{[29]} A Bayes factor of 100 indicates 100:1 odds against the null hypothesis. P values should be mentioned along with the relevant Bayes factor (upper) bound (BFB). The latter represents the highest odds disfavoring the null hypothesis. For example, for P = 0.005, the maximum BFB against the null hypothesis is 13.9:1, whereas for P = 0.05 the maximum BFB against the null hypothesis is 2.45:1.^{[30]}
The estimated false positive risk is the probability of a significant P value to falsely reject the null hypothesis. It can be easily estimated from a simple Bayesian framework. For example, if you choose 80% power for your study and presume 30% chance of an effect, then at P = 0.05, the estimated false positive risk will be 13%.^{[14]}
To overcome publication bias, various methods have been proposed. For instance, a shrinkage method involves discounting 10% of the intervention effect from the phase II study while designing a phase III study.^{[31]} Another recommendation is using the bottom bound of the observed 95% confidence interval from the phase II study when designing a phase III study.^{[32]}
A higher level of evidence may limit the false discovery rate in exploratory research. For example, with α = 0.005 and 80% power, a study with 10% true effect will have a false discovery rate of 5%, but will mandate a 70% increase in the sample size, incurring a bigger financial burden. Some wiser ways of using the P value are along with the 95% confidence interval and the degree of uncertainty. Reducing the P < 0.001 strengthens the evidence.^{[33]} Metaanalyses yield much narrower confidence intervals than individual studies. However, they suffer from the “file drawer phenomenon,” where nonsignificant results are not published.^{[14]}
Certain norms are proposed to make the best use of P values to measure the strength of the evidence. Rather than dichotomously fixing the P = 0.05, one of the other measures as discussed above should be adopted to convey rational information. These may include excluding the scope of nonnull effects, including the range of meaningful benefits, applying the appropriate adjustments for multiplicity and selective inference, discounting the publication and selection bias and low power, interpreting the P value on a log scale, and differentiating between exploratory and confirmatory research.^{[34]}
Conclusion   
We cannot expect each strong result of an exciting trial to be replicable in subsequent studies. Of late, the dependency on a binary interpretation of the P value has been implicated as a predominant cause of replication failure. A guideline for updating the essential statistics curriculum for trainees and scientists to focus intentionally on the new norms and customs of utilization of the P value in scientific research is the need of the hour. Scientific inferences and users' norms should not be victims of an arbitrary cutoff of the P value.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References   
1.  
2.  Fisher RA. Statistical Methods for Research Workers. London: Oliver and Boyd; 1925. 
3.  Fisher RA. Statistical Methods and Scientific Inference. Edinburgh: Oliver & Boyd; 1956. 
4.  Gibson EW. The role of P values in judging the strength of evidence and realistic replication expectations. Stat Biopharm Res 2021;13:618. 
5.  Neyman J, Pearson ES. On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika 1928;20A: 175240. 
6.  Yau T, Park JW, Finn RS, Cheng AL, Mathurin P, Edeline J, et al. CheckMate 459: A randomized, multicenter phase III study of nivolumab (NIVO) vs. sorafenib (SOR) as firstline (1L) treatment in patients (pts) with advanced hepatocellular carcinoma (aHCC). Ann Oncol 2019;30:8745. 
7.  Rudin CM, Awad MM, Navarro A, Gottfried M, Peters S, Csőszi T, et al. Pembrolizumab or placebo plus etoposide and platinum as firstline therapy for extensivestage smallcell lung cancer: Randomized, doubleblind, phase III KEYNOTE604 study. J Clin Oncol 2020;38:236979. 
8.  SerraGarcia M, Gneezy U. Nonreplicable publications are cited more than replicable ones. Sci Adv 2021;7:eabd1705. 
9.  Wasserstein RL, Lazar NA. The ASA statement on p values: Context, process, and purpose. Am Stat 2016;70:12933. 
10.  de Groot AD. The meaning of “significance” for different types of research [translated and annotated by EricJan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas]. 1969. Acta Psychol (Amst) 2014;148:18894. 
11.  Staquet MJ, Rozencweig M, Von Hoff DD, Muggia FM. The delta and epsilon errors in the assessment of cancer clinical trials. Cancer Treat Rep 1979;63:191721. 
12.  DiMasi JA, Feldman L, Seckler A, Wilson A. Trends in risks associated with new drug development: Success rates for investigational drugs. Clin Pharmacol Ther 2010;87:2727. 
13.  Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature 2019;567:3057. 
14.  Halsey LG. The reign of the p value is over: What alternative analyses could we employ to fill the power vacuum? Biol Lett 2019;15:20190174. 
15.  Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond “p < 0.05”. Am Stat 2019;73:119. 
16.  Harrington D, D'Agostino RB Sr., Gatsonis C, Hogan JW, Hunter DJ, Normand ST, et al. New guidelines for statistical reporting in the journal. N Engl J Med 2019;381:2856. 
17.  Cook JA, Fergusson DA, Ford I, Gonen M, Kimmelman J, Korn EL, et al. There is still a place for significance testing in clinical trials. Clin Trials 2019;16:2234. 
18.  Lambert D, Hall WJ. Asymptotic lognormality of p values. Ann Stat 1982;10:4464. 
19.  Hung HM, O'Neill RT, Bauer P, Köhne K. The behavior of the p value when the alternative hypothesis is true. Biometrics 1997;53:1122. 
20.  Goodman SN. A comment on replication, p values and evidence. Stat Med 1992;11:8759. 
21.  Betensky RA. The p value requires context, not a threshold. Am Stat 2019;73:1157. 
22.  Bretz F, Westfall PH. Multiplicity and replicability: Two sides of the same coin. Pharm Stat 2014;13:3434. 
23.  Lane DM, Dunlap WP. Estimating effect size: Bias resulting from the significance criterion in editorial decisions. Br J Math Stat Psychol 1978;31:10712. 
24.  Rockhold F, Bromley C, Wagner EK, Buyse M. Open science: The open clinical trials data journey. Clin Trials 2019;16:53946. 
25.  Gelman A, Carlin J. Beyond power calculations: Assessing type S (Sign) and type M (Magnitude) errors. Perspect Psychol Sci 2014;9:64151. 
26.  Lazzeroni LC, Lu Y, BelitskayaLévy I. Solutions for quantifying p value uncertainty and replication power. Nat Methods 2016;13:1078. 
27.  KennedyShaffer L. When the alpha is the omega: p values, “substantial evidence,” and the 0.05 standard at FDA. Food Drug Law J 2017;72:595635. 
28.  Temple J, Wößmann L. Dualism and crosscountry growth regressions. J Econ Growth 2006;11:187228. 
29.  Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc 1995;90:77395. 
30.  Benjamin DJ, Berger JO. Three recommendations for improving the use of p values. Am Stat 2019;73:18691. 
31.  ChuangStein C, Kirby S. The shrinking or disappearing observed treatment effect. Pharm Stat 2014;13:27780. 
32.  Hung HM, Wang SJ, O'Neill RT. Methodological issues with adaptation of clinical trial design. Pharm Stat 2006;5:99107. 
33.  Pocock SJ, McMurray JJ, Collier TJ. Making sense of statistics in clinical trial reports: Part 1 of a 4part series on statistics for clinical trials. J Am Coll Cardiol 2015;66:253649. 
34.  Boos DD, Stefanski LA. p value precision and reproducibility. Am Stat 2011;65:21321. 
[Figure 1]
