Article Text

Download PDFPDF

Was it a good idea to combine the studies? Why clinicians should care about heterogeneity when making decisions based on systematic reviews
  1. Hege Grindem1,
  2. Mohammad Ali Mansournia2,3,
  3. Britt Elin Øiestad4,
  4. Clare L Ardern5,6
  1. 1 Department of Sports Medicine, Norwegian School of Sport Sciences, Oslo, Norway
  2. 2 Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran
  3. 3 Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, Tehran, Iran
  4. 4 Department of Physiotherapy, Oslo Metropolitan University, Oslo, Norway
  5. 5 Division of Physiotherapy, Linköping University, Linköping, Sweden
  6. 6 School of Allied Health, La Trobe University, Melbourne, Victoria, Australia
  1. Correspondence to Professor Mohammad Ali Mansournia, Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran; mansournia_ma{at}

Statistics from

Imagine you are the clinician who is responsible for a youth athletic programme. You know there are effective injury prevention programmes to reduce the risk for ACL injuries; the coach/manager asks you to decide which programme to implement for your team. However, ACL injuries are not as common as other injuries. To help decide whether it is worth making these programmes mandatory for all athletes under your care, you need to know the risk for ACL injury. A recent systematic review of 58 studies from various sports provides some data.1 As you read the systematic review, you note that the authors advise cautious interpretation of the results because of heterogeneity. What is heterogeneity and should you care?

Heterogeneity is a challenging issue in systematic reviews and a key factor in the decision to pool or not to pool the results of available studies. Here we guide readers and authors of systematic reviews on heterogeneity and its sources: clinical and methodological diversity (box 1).2

Box 1

Definitions of key terms2

Heterogeneity: Genuine variability in the results of individual studies (above that expected by chance). The source of this variability is clinical or methodological diversity in the studies.

Clinical diversity: Between-study differences in participant characteristics, interventions or outcomes.

Methodological diversity: Between-study differences in design and risk of bias.

Bias: Methodological issues that systematically distort results.

Heterogeneity source 1: clinical diversity

Consider the individual studies in the ACL systematic review.1 Would we expect athletes of different ages, who play different sports at different levels to have the same risk for ACL injury? BJSM readers know, either from clinical practice or research, that different factors influence the risk for injury to a greater or lesser extent. Given your knowledge, do the included studies represent athlete groups with inherently different injury risks? Some studies include athletes who perform effective injury prevention training, while others do not. One study reports all ACL injuries, while another study only reports non-contact injuries.

Clinical diversity is a double-edged sword: too little diversity and the conclusion will only apply to very small subgroups or certain outcomes. On the flip side, combining participants, interventions and outcomes that are too diverse will make it impossible to arrive at a meaningful conclusion.

Heterogeneity source 2: methodological diversity

Studies vary in how well they are designed to guard against bias. This contributes to inconsistent results. Using a tool to standardise the assessment of bias can help us recognise the most important contributors to bias. However, content-specific methodological expertise may also be needed to recognise methodological factors that skew individual study results. For example, after systematically reviewing the literature on knee osteoarthritis (OA) after ACL injury, Øiestad et al 3 saw that seven different radiological scoring methods had been used to define OA. They considered that the different ways used to define OA could lead to inconsistent results and decided a meta-analysis would not provide a meaningful summary.

Unfortunately, ACL injuries may be as difficult to synthesise as knee OA. Consider two methodological issues and how they might affect the reported injury incidence:

  1. Studies that rely on medical staff to record injuries may only capture 41% of all knee injuries, but studies with athlete-reported injuries may capture 92%.4

  2. Missing exposure data directly impact on exposure-adjusted injury incidence rates. If a study has 50% missing exposure data and this is not addressed by imputing missing values, the reported incidence will be artificially doubled.5

Heterogeneity and implications for analysis

The insightful meta-analyst will first consider the diversity in participants and methods. What are the known sources of heterogeneity? To produce meaningful estimates of ACL incidence rates, separate estimates should at least be calculated for similar types of sports, athletes of similar ages and similar follow-up times. Once the authors are confident, the studies (or subgroups of studies) are sufficiently similar in terms of participants and methods to be combined, a meta-analysis may be performed. After the studies have been analysed, we can assess heterogeneity, that is, if the results of the included studies are inconsistent with one another. As a starting point, consider a forest plot (figure 1). Heterogeneity can be a problem if the CIs of the individual study estimates do not overlap. Box 2 6–9 includes more advanced ways of assessing heterogeneity.6–9

Box 2

Ways to assess heterogeneity

  1. Graphical displays:

    1. Funnel plot6: The estimates from each study are plotted against a measure of the study’s size or precision—usually the SE. Heterogeneity can lead to an asymmetric funnel plot if there is a correlation between study size and results.

    2. Galbraith plot7: The ratio of the individual study estimate to its SE is plotted against the reciprocal of the SE. The pooled estimate and 95% confidence limits are represented by the slope of a solid line and two dotted lines. In the absence of heterogeneity, about 95% of the points should lie between the two dotted lines. Outliers indicate heterogeneity.

    3. Histogram of Z-scores8: Z-scores are computed by subtracting the pooled estimate from the individual study estimate and then dividing by the SE of the individual study estimate. If variability in results is due to chance and not heterogeneity, the Z-scores should approximate a normal curve. Large absolute Z-scores can also signal heterogeneity.

  2. Statistical test of heterogeneity,9 for example, Cochran’s Q or the I2 statistic. The I2 statistic is often preferred because it is independent of the number of included studies and easier to interpret. As a rule of thumb, I2 values between 75% and 100% represent high levels of heterogeneity (figure 1).

Note that all methods require that an adequate number of studies are included in the meta-analysis (minimum 10 studies is recommended for funnel plots2). Additionally, the I2 statistic is not useful if the individual studies in the meta-analysis are very big.

Figure 1

(A) Studies are weighted similarly despite a wide range of study sizes. (B) The I2 value is 98%, indicating very high levels of heterogeneity. Adapted from Montalvo et al 1 (figure 3). Reprinted with permission from BMJ Publishing Group Ltd.  

If there is substantial heterogeneity, authors might choose to analyse the data with a random-effects meta-analysis. One way to think about this is to assume that each study represents one subset of a larger athlete population and a random-effects meta-analysis will give us a valid summary measure for this population. But if the larger population consists of subpopulations that have widely different risks for injury (eg, paediatric athletes, males and female football players, swimmers), a summary measure would not apply to any of these individual subpopulations. Another problem with random-effects summaries is that small studies are given more weight than in a fixed-effect summary (figure 1). The analysis is therefore vulnerable to biases that affect small studies more than large studies (eg, publication bias). Random-effects meta-analysis is therefore not a panacea for all problems related to heterogeneity.

A rich data set also provides an opportunity for authors to explore potential sources of heterogeneity with meta-regression or subgroup analyses. Knowing why results vary across studies can be more important than knowing the overall summary for a large population. These analyses can lead to the discovery of important clinical subgroups or of methodological issues that should be given more weight when we plan and interpret studies. To avoid spurious findings, a low number of hypotheses that have a reasonable rationale should have been prespecified in the review protocol. If the sources of heterogeneity are successfully identified, separate estimates can be reported for different subgroups, resolving the issue. However, if the sources of heterogeneity are not identified, the most appropriate solution may be to abandon meta-analysis and conduct a high-quality qualitative synthesis (eg, best-evidence synthesis) of the results.


When you read a systematic review with meta-analysis, we recommend you ask yourself four questions (box 3) to help determine whether the studies have been combined in a meaningful way.

Box 3

Does it make sense to combine the studies?

  1. Is there too much clinical diversity for the results to make sense? Consider participant characteristics, interventions and outcomes.

  2. Are differences in study methodology likely to yield inconsistent results? Risk of bias assessment results can help you identify potential problems with study methods (eg, lack of participant and assessor blinding in a randomised controlled trial or inclusion of participants who might already have the outcome of interest before an observational study started), but content-specific methodological expertise is also important.

  3. Do any graphs suggest heterogeneity is a problem? Helpful graphs include a funnel plot, Galbraith plot and a histogram of Z-scores.

  4. Do statistical tests indicate heterogeneity? Consider Cochran’s Q or the I2 statistic.

Obtaining one answer for a large population can have merit. However, often, one size does not fit all. This is vital for research that guides the decisions of individual athletes, clubs and clinicians. Careful consideration will help you determine if the results are important for your clinical or sports practice.



  • Contributors HG and CLA proposed the initial idea. All authors contributed to manuscript draft and revisions. All authors approved the final version.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient consent Not required.

  • Provenance and peer review Not commissioned; internally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.