Article Text

Download PDFPDF

Misinterpretations of the ‘p value’: a brief primer for academic sports medicine
  1. Steven D Stovitz1,
  2. Evert Verhagen2,
  3. Ian Shrier3
  1. 1 Department of Family Medicine and Community Health, University of Minnesota, Minneapolis, Minnesota, USA
  2. 2 Department of Public and Occupational Health Amsterdam, VU University Medical Center, Amsterdam, The Netherlands
  3. 3 Centre for Clinical Epidemiology, Lady Davis Institute, Jewish General Hospital, McGill University, Montreal, Canada
  1. Correspondence to Steven D Stovitz, Department of Family Medicine and Community Health, University of Minnesota, 420 Delaware Street SE, MMC 381, Minneapolis, MN 55455, USA; stovitz{at}umn.edu

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

When comparing treatment groups, the p value is a statistical measure that summarises the chance (‘p’ for probability) that one would obtain the observed result (or more extreme), if and only if, the treatment is ineffective (ie, under the assumption of the ‘null’ hypothesis). The p value does not tell us the probability that the null hypothesis is true.1 This editorial discusses how some common misinterpretations of the p value may impact sports medicine research. Although presented from a treatment standpoint, the same principles hold for causes or prevention.

Probabilities do not translate into yes or no decisions

p Values are probabilities, yet often interpreted based on a categorical cut-off, generally at the level of 0.05 (ie, 5%). Anything below is considered a ‘statistically significant difference’ and vice versa. However, one would not change a decision to buy a lottery ticket if the chance of winning was 4.9% (p=0.049i) instead of 5.1% (p=0.051). Consider a study where 100 participants who were given an injury prevention programme had six injuries, and 100 participants in the control group had 13 injuries (p=0.091). If the prevention group had one fewer injury (ie, 5/100 injuries), the results become statistically significant (p=0.048). Is it appropriate to conclude the treatment is ineffective when a single injury changes the results from non-significant to significant? Using a cut-off converts a continuous variable (the probability) into a categorical variable (yes/no). Presenting actual p values rather than simply >0.05 or <0.05 may help readers understand that any cut-off is arbitrary.

Is the p value clinically meaningful?

With a sufficient number of participants, even a small difference between groups may result in p<0.05. Consider an exercise programme for weight loss that lowered the mean body weight of participants from 110 kg (SD=5 kg) to 109.5 kg. The expected p value with 100 participants is 0.43,ii with 1000 participants it is 0.10, and with 10 000 participants it is 5×10−7. Although ‘statistically significant’ with 10 000 participants, a decrease of 0.5 kg is not clinically meaningful in terms of reduction in obesity-related health problems. Exactly what difference is clinically meaningful will depend on context for particular problems and not p values which incorporate sample size and group differences.

The power of chance

Sports medicine researchers often investigate multiple outcomes such as pain, weakness and function. If p<0.05 for any one outcome, it is often considered a statistically significant difference. Readers may conclude that the treatment was the cause of the difference in the outcome, assuming that the difference was unlikely to be due to chance. This is mathematically incorrect. Consider a 20-sided die with numbers 1–20 on each side. The probability of obtaining a ‘1’ with one roll is 5%. This is equivalent to our p-value of 0.05 for one outcome. The probability of obtaining a ‘1’ at least once on five rolls (equivalent to five outcomes) is actually 22.6%. Similarly, in a study with a treatment that is ineffective for all five outcomes, one would expect to see a p-value <0.05 for at least one outcome ∼22.6% of the time.

The problem with the misinterpretation of a low p value as statistically significant when the study evaluates multiple outcomes was highlighted when a science journalist fooled major media outlets with a study suggesting that eating chocolate helps to lose weight.2 The group studied 18 different outcomes, which means there was a 60% probability of having p<0.05 for at least one of the outcomes by chance. In this study, the ‘significant’ outcome happened to be weight loss. If the study were repeated, there would likely be a different outcome that occurred with p<0.05. Replication research is essential so that practitioners are not fooled into thinking that differences are due to treatment when they may, in fact, be due to chance.3

Summary and recommendations

In summary, the p value is a probability under the assumption that a treatment is ineffective. Sports medicine practitioners must understand that translation of this probability into a categorical decision is questionable. Certainly, a value above the level of 0.05 does not mean that the groups are the same and a value below 0.05 does not mean that the treatment difference is meaningful, or even that the groups are different. The problems associated with misinterpreting p values, as outlined above, may be decreased if authors present the actual estimate differences with confidence intervals.4–6 Although beyond the scope of this article, confidence intervals provide readers with an estimated range of true differences that are likely compatible with the observed results and provide much more information than the p value.4–6 There is no consensus on the best way to account for multiple comparisons. It is necessary for authors to use some type of statistical correction (and/or alter the p value for rejecting the null hypothesis) or acknowledge the issue of multiple comparisons as a limitation of the study analysis.

References

Footnotes

  • Twitter Follow Evert Verhagen @Evertverhagen

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • i p Value calculated via two-tailed χ2 statistical test.

  • ii p Value calculated via two-tailed t-test.

Linked Articles