Statistics from Altmetric.com
Systematic reviews are a valuable tool to inform healthcare decision-making.1 2 While a single randomised controlled trial (RCT) is insufficient to definitively guide healthcare decisions, a systematic review synthesising multiple RCTs can overcome this limitation. The results of rigorous systematic reviews possess wide-ranging applicability to numerous stakeholders within the evidence-based medicine ‘ecosystem’. Clinicians consult systematic reviews to inform their clinical decisions.3 Researchers rely on systematic reviews to identify knowledge gaps in existing literature.4 Health policymakers use systematic review evidence to inform practice guidelines and legislation.5 6 Journal editors often prioritise systematic reviews for their impact on readership attention and journal metrics.7 Finally, patients are empowered by systematic reviews that assess the beneficial and harmful patient-important outcomes of available management strategies.8 Evidently, systematic review authors have an important responsibility to ensure their findings provide the most accurate results possible.
The biomedical literature expands by 22 systematic reviews daily,9 with no evidence that production is waning. More systematic reviews are desirable if they identify and inform important research questions that improve patient care.10 However, production of this magnitude is problematic when systematic reviews offer ‘extensive redundancy, little value, misleading claims and/or vested interests’.11 As we outlined in part 1, bias is a systematic deviation from the truth in the results of a research study due to limitations in study design, conduct, or analysis.2 Deviations may either overestimate or underestimate a study’s true findings depending of the type and magnitude of bias. As the results of a systematic review are only as valid as the studies it includes, pooling biased results from different studies can compromise the credibility of systematic review findings when no assessment, or a poor assessment, of risk of bias is performed.3 12
Inadequate study design, conduct, or analysis devalues the credibility of biomedical research and the competence of clinicians to care for patients.13 Comprehensively assessing risk of bias instead of study quality—using a domain-based risk of bias assessment instead of a quality scale or checklist—is the best method to avoid overlooking biased evidence that propagates misleading systematic review findings. This two-part education primer focuses on critical assessments as a source of misleading systematic review conclusions. In part 1, we introduced risk of bias as the perceived risk that the results of a research study may deviate from the truth. In part 2, we:
Evaluate the prevalence and methods of critical assessments in systematic reviews published in BJSM.
Perform a risk of bias assessment on a sample of RCTs in a systematic review, to compare risk of bias assessment findings with study quality assessment findings.
Illustrate the impact that different critical assessment tools have on risk of bias assessment findings, and ultimately, systematic review findings.
Provide recommendations to systematic review authors who are planning a risk of bias assessment.
Empirical evidence from sport and exercise medicine
Sport and exercise medicine (SEM) research is vulnerable to bias across many empirical study designs.14 15 The characteristics of risk of bias assessments in SEM systematic reviews are unknown. Evaluating how risk of bias assessments are conducted is necessary to determine whether risk of bias is adequately assessed across SEM research. We performed a cross-sectional study to identify the methods used to critically appraise the credibility of original study findings in systematic reviews published in BJSM.
We searched the BJSM journal archive (http://bjsm.bmj.com/content/by/year) on 11 May 2017 to identify systematic reviews published since 1 January 2016. Systematic reviews performing a descriptive synthesis or meta-analysis, and published in BJSM between 1 January 2016 and 10 May 2017, were eligible for inclusion.
Two authors (FCB and ED) independently screened titles and abstracts to identify systematic reviews. Review article types that did not systematically identify and select eligible studies, and synthesise relevant study content (eg, narrative/critical reviews, PEDro syntheses, consensus statements and practice guidelines) were excluded. A third author (CLA or MW) arbitrated disagreements. Descriptive characteristics of each systematic review were independently abstracted by two authors (FCB and ED) using a predefined data extraction template (table 1).
One author (FCB) performed pilot data extraction on a subsample of systematic reviews prior to full data extraction. Two authors (FCB and ED) independently extracted data from each included systematic review. A third author (CLA or MW) arbitrated disagreements. Data are presented as absolute frequencies (n) and as a proportion (%) of the sample of systematic reviews.
We included 66 systematic reviews.
Sixty-five (99%) systematic reviews reported a critical assessment of included studies. Table 2 lists the characteristics of critical assessment tools used. Critical assessment tools used in systematic reviews are included in online supplementary table 2.
Forty-two (65%) systematic reviews used a standard tool, 14 (21%) used an adapted tool and 9 (14%) used a custom tool.
Eighteen (28%) systematic reviews used a checklist to critically assess included studies, 38 (58%) used a scale and 9 (14%) used a domain-based assessment tool.
Fifty (77%) systematic reviews used a ranking system that ranked studies based on critical assessment findings. Nineteen (38% of 50) systematic reviews ranked studies based on a summary score of methodological study quality. Thirty-one (62% of 50) ranked studies using a threshold summary score to classify studies in categories of ‘high’, ‘moderate’ or ‘low’ quality.
Of 65 systematic reviews that performed a critical assessment of included studies, 11 (17%) performed a risk of bias assessment for separate outcomes. Three (5%) performed a domain-based risk of bias assessment for separate outcomes; 10 (15%) performed domain-level risk of bias assessments but not for separate outcomes. Forty-seven (72%) systematic reviews presented risk of bias assessment findings for each tool item but not for individual risk of bias domains or for separate outcomes.
Two systematic reviews (3%) performed a meta-regression to examine the quantitative influence of risk of bias or study quality of each included study on individual study effect size. Ten (15%) performed a sensitivity analysis to compare the effect estimates of studies at ‘high’ and ‘unclear’ risk of bias to all included studies. One systematic review (2%) excluded studies at ‘high’ and ‘unclear’ risk of bias from the synthesis to restrict their evidence synthesis to studies at low risk of bias. Forty-three (66%) narratively discussed risk of bias assessment findings in the context of systematic review findings, and nine systematic reviews (14%) did not incorporate risk of bias assessment findings into review findings.
Critical re-assessment using a risk of bias assessment instead of an assessment of study quality
The features of risk of bias assessments including the terminology, the assessment tool and method, and the incorporation method used, influence risk of bias assessment findings. In this section, we present a worked example where we re-assess the risk of bias of RCT outcomes in one BJSM systematic review.16 By assessing the risk of bias of RCT outcomes, we illustrate the impact that different critical assessment tools can have on risk of bias assessment findings, and ultimately, systematic review findings.
Methods of risk of bias assessment
We used prespecified inclusion criteria to identify a systematic review (from our original group of 65) that included intervention studies and intended to assess risk of bias but evaluated features other than risk of bias (eligibility criteria—online supplementary file 1). Our eligibility criteria identified a systematic review16 that investigated the efficacy of therapeutic interventions to improve patient-reported function in participants with chronic ankle instability (CAI). This systematic review included 17 original research studies: 11 RCTs, 1 cohort study, 2 case-control studies, and 3 case-series. For the purpose of this education review, we re-assessed only the 11 RCTs included in this systematic review.16 An author of the current study (ED) contacted a member of the systematic review team to articulate the current study’s aims and to obtain the review authors’ permission to perform a risk of bias assessment of RCTs included in their systematic review. The lead and supervising authors of this systematic review16 will provide a commentary in response to the current study’s findings in the context of their systematic review findings.
We assessed the risk of bias in outcomes of patient-reported function (ie, activities of daily living (ADL) and sports subscales) in 11 RCTs16 using the Cochrane Risk of Bias tool 2 (RoB2). RoB2 is the revised, second edition of the Cochrane Risk of Bias tool for RCTs.17 18 RoB2 is an outcome-focused, domain-based tool that assesses the risk of bias in outcomes in individually-randomised, parallel-group trials, randomised cross-over trials, and cluster-RCTs.18
RoB2 features five risk of bias domains for individually randomised, parallel-group trials:
Bias arising from the randomisation process.
Bias due to deviations from intended interventions.
Bias due to missing outcome data.
Bias in measurement of the outcome.
Bias in selection of the reported results.
Responses to RoB2 signalling questions about specific study limitations are mapped using a decision algorithm to determine each risk of bias domain judgement.18 Finally, an overall risk of bias judgement is made for each assessed outcome, in each trial, based on the domain-level assessment. An outcome is judged as ‘low risk’ when all domains are judged to be at ‘low’ risk of bias. An outcome is determined to have ‘some concerns’ of bias when one or more domains are judged to be at ‘some concerns’ of bias. An outcome is judged as ‘high risk’ of bias when at least one domain is judged to be at ‘high’ risk of bias, or when multiple domains have ‘some concerns’. Multiple domains at ‘some concerns’ of bias increase the likelihood that treatment effect (ie, a high overall risk of bias) may be distorted. Strong meta-epidemiological evidence supports the content of risk of bias domains included in RoB2.19
Two independent assessors (FB and ED) re-assessed the risk of bias of the 11 RCTs using the revised version of RoB2 from 9 October 2018 (http://www.riskofbias.info).18 Assessors resolved initial disagreement via discussion. A third, independent assessor (MW and RE) arbitrated any persisting disagreements. Assessor were not blind to the findings of the Downs and Black quality assessment performed by Kosik et al. 16
Methods of study quality assessment
Results of study quality assessment and risk of bias assessments
Kosik et al 16 performed an intended risk of bias assessment using the Downs and Black checklist. The Downs and Black checklist is a quality assessment scale developed to evaluate the methodological quality of randomised and non-randomised trials.20 The checklist comprises 27 items across 4 subscales including: completeness of reporting (9 items); internal validity (13 items); precision (1 item) and external validity (3 items). The Downs and Black checklist assigns numeric values to item responses (‘yes’=1; ‘no’=0; ‘unable to determine’=0). Each item score is summed; higher scale summary scores indicate superior study quality. The Downs and Black checklist considers the methodological quality of each study irrespective of the number and type of different outcomes assessed in each study.
Risk of bias reassessment: bias arising from the randomisation process
Eight RCTs (73%) had outcomes at ‘some concerns’ of bias arising from the randomisation process. This was due to improper or unclear methods of sequence generation, allocation concealment, or due to imbalances in group baseline demographic characteristics. Outcomes in 2 RCTs (18%) had ‘high’ risk of bias arising from the randomisation process. One RCT (9%) had an outcome at ‘low’ risk of bias.
Risk of bias reassessment: bias due to deviations from intended interventions
Seven RCTs (64%) had outcomes at ‘high’ risk of bias due to deviations from intended interventions with specific interest in lack of adherence to the intervention. Two RCTs (18%) had ‘some concerns’ of bias and two trials (18%) had ‘low’ risk of bias.
Risk of bias reassessment: bias due to missing outcome data
Two RCTs (18%) had outcomes at ‘high’ risk of bias due to missing outcome data. Nine RCTs (82%) had outcomes that were at ‘low’ risk of bias due to little, or no, missing outcome data.
Risk of bias reassessment: bias in measurement of the outcome
Nine RCTs (73%) had outcomes at ‘high’ risk of bias, predominantly due to a lack of participant blinding. Trial participants were outcome assessors due to the use of patient-reported outcome measures. Two trials (18%) were at ‘low’ risk of bias and one trial (9%) was at ‘some concerns’ of bias in measurement of the outcome.
Risk of bias reassessment: bias in selection of the reported results
One RCT (9%) was at ‘high’ risk of bias in selection of the reported results. Ten RCTs (91%) had outcomes at ‘some concerns’ of bias. Judgements of ‘some concerns’ of bias were due to no available prespecified trial protocol or analysis plan, and the possibility for many statistical analyses, other than that reported in the results of each trial, to be performed.
Risk of bias reassessment: overall risk of bias for patient-reported function
All RCTs (k=11; 100%) were at ‘high’ overall risk of bias for all intervention comparisons and follow-up assessment time-points. RCTs were at ‘high’ overall risk of bias due to the presence of at least one risk of bias domain at ‘high’ risk of bias or multiple risk of bias domains at ‘some concerns’ of bias (table 3).21–31
The assessment of study quality (using the Downs and Black checklist) produced a mean scale summary score of 21 out of 31 (minimum-maximum=11–24) across 11 RCTs.16 Eight RCTs were judged by Kosik et al as high quality and three RCTs were judged by Kosik et al as moderate quality.
Using the Downs and Black checklist, the majority of included RCTs (8/11) were judged to be high-quality trials. Kosik et al interpreted study quality assessment findings to provide moderate-quality to high-quality evidence for therapeutic interventions improving patient-reported function in individuals with CAI. Using ROB2 on the same sample of RCTs, all trials were judged to be at ‘high’ overall risk of bias. Our interpretation is that the current evidence for therapeutic interventions in individuals with CAI is likely prone to bias-limiting conclusions.
IMPLICATIONS & RECOMMENDATIONS
We found a high prevalence of critical assessments among systematic reviews published in BJSM. However, risk of bias assessments were infrequently performed in systematic reviews. Many systematic reviews assessed study quality instead of risk of bias, which impairs accurate inferences about the credibility of study outcomes.2 32 33
Performing an assessment of study quality instead of a risk of bias assessment may lead to invalid systematic review findings, conclusions, and recommendations. The quality assessment of 11 RCTs in our worked example produced a mean summary score of 20.5 out of 31 (66%; minimum-maximum=11–24), indicating moderate-quality to high-quality evidence.16 Kosik et al did not categorise scale summary scores using cut-off tertiles.16 If categorised into by low (ie, 10-20), moderate (ie, 11-20), and high quality (ie, 21-31), eight RCTs were of high quality, three RCTs were of moderate quality and no RCT was judged to be of low quality (table 4). However, using a risk of bias assessment, all RCTs were judged to be at ‘high’ overall risk of bias using RoB2 (table 4).
Domains should dominate risk of bias assessments
Despite empirical evidence to support the use of domain-based risk of bias assessment tools, approximately one-third of SEM systematic reviews included in our sample did not use an empirically supported tool. Only 14% of systematic reviews performed a domain-based risk of bias assessment while most used scales or checklists. This estimate of 14% in our sample of SEM systematic reviews is lower than in biomedicine where 41% of non-Cochrane systematic reviews used a domain-level risk of bias assessment tool.9
Scale summary scores often lack meaning and omit valuable information about specific study limitations in individual studies.34 35 For example, quality summary scores presented in the systematic review example ranged from 11/31 to 24/31 (table 4).16 Reporting only one numeric value provides insufficient detail to highlight specific limitations in trial design, conduct, or analysis that threaten trial validity.36 37
Using RoB2, in every RCT we identified at least one domain at ‘high’ risk of bias or multiple domains at ‘some concerns’ of bias for each assessed outcome. Due to available evidence on the relationships between various risk of bias domains and distorted effect estimates,9 38 39 the use of an established domain-based tool enables a deeper understanding about the likely direction and size of trial effect estimates associated with bias. Our risk of bias assessment findings highlight that despite high trial quality,16 all RCTs were at ‘high’ overall risk of bias. This finding renders the current evidence of therapeutic interventions for CAI prone to bias-limiting conclusions about patient-reported function.
Focus on outcomes, not trials
Systematic review authors should perform separate risk of bias assessments for different outcomes rather than assess all review outcomes simultaneously as one general risk of bias assessment.2 Study limitations can distort separate outcomes differently,9 40 necessitating separate risk of bias assessments for different outcome types. For example, a subjective outcome, such as self-reported pain, is more likely to be overestimated when a patient is aware of their allocation to a specific intervention group (due to lack of patient blinding) than if they were not aware of their group allocation.9 Conversely, a patient’s awareness of their allocation to a therapeutic intervention group is less likely to influence an objective outcome such as re-injury.
Only half of systematic reviews across biomedicine, and 17% in our sample, performed separate risk of bias assessments for subjective and objective outcomes.41 Kosik et al 16 reported patient-reported function in ADL and sporting activities. Due to the subjective nature of patient-reported outcomes, the effect of study limitations due to deviation from intended interventions (eg, lack of patient blinding) may overestimate differences between intervention and control groups.
Incorporating risk of bias assessment findings into systematic review findings
Incorporating risk of bias assessment findings into systematic review findings allows an interpretation of overestimated or underestimated study outcomes, to avoid misleading conclusions (table 5). High numbers of studies at ‘some concerns’ or ‘high’ risk of bias necessitate a more cautious interpretation of review findings. Two-thirds of systematic reviews in our sample presented a qualitative description of risk of bias assessment findings. However, these systematic reviews did not estimate the likely impact of bias on systematic review outcomes.19 Approximately one in five systematic reviews used quantitative methods to adjust the review effect estimate based on the risk of bias present in included studies (table 2).2 41–44
How believable is the review effect estimate if bias is present?
In the RCTs descriptively synthesised by Kosik et al,16 small, non-significant between-group differences could represent an underestimate due to bias that under-represents the true magnitude of an intervention’s effect. The true difference between intervention and control groups may actually be larger than concluded, but we cannot be certain because of a ‘high’ risk of bias across many domains. Due the the influence of bias, therapeutic interventions may be more or less effective than concluded in this systematic review.16 Applying an evidence-grading tool to ‘high’ overall risk of bias judgements in this systematic review lowers the quality of evidence and strength of recommendation for balance training and multimodal treatment (Strength of Recommendation Taxonomy used by Kosik et al 16).
In the presence of inarguable demonstrations that substantiate why ‘most published research findings are false’,45 researchers must be able to identify study outcomes at ‘some concerns’ or ‘high risk’ of bias. To accomplish this, researchers need to perform separate domain-based risk of bias assessments (for separate subjective and objective outcomes) that evaluate potential threats to a study’s internal validity. In box 1, we present recommendations for systematic review authors and editorial team members to inform the conduct of a valid risk of bias assessment. These recommendations reflect a blend of information from contemporary texts in evidence synthesis methods, peer-reviewed meta-research, editorial and evidence synthesis experience, and the findings of our methodological study.
Guidance checklist for systematic review authors, peer-reviewers, and editorial team members when performing and interpreting risk of bias assessments:
Assess risk of bias: avoid assessing study quality in place of risk of bias. Use a rigorously developed, study design-specific risk of bias tool such as RoB2 (for randomised intervention trials), ROBINS-I (for non-randomised interventions studies), ROBIS (for systematic reviews), PROBAST (for prediction modelling studies), QUIPS-II (for prognostic studies) or QUADAS-II (for diagnostic accuracy studies).
Use a risk of bias assessment tool in its original form: do not modify risk of bias tools by adding new items or omitting existing items that are deemed to be relevant or irrelevant to the assessment of risk of bias, respectively. Modifying a standard risk of bias assessment tool can negatively impact on the sensitivity of the risk of bias assessment by potentially including an item that does not address risk of bias. Systematic review authors should not develop their own critical assessment tool to assess risk of bias.
Evaluate risk of bias and present risk of bias assessment findings using a domain-level risk of bias assessment: domain-level risk of bias assessments consider specific study limitations, rather than one summary score, that can introduce different biases (eg, bias arising from the randomisation process). Individual risk of bias domains contribute differently to the extent that study limitations may distort study results.
Avoid using scales and checklists to assess risk of bias: scales and checklists frequently generate summary scores by using one numeric value, which omits information about diverse sources of bias. Judging the risk of bias for an outcome based on individual domains is recommended because several domains at ‘some concerns’ or ‘high’ risk of bias raise the suspicion that an effect estimate may be distorted.
Avoid cut-off thresholds that categorise studies based on risk of bias or study quality: cut-off thresholds that dichotomise or categorise ordinal scales of study quality into nominal groups (of ‘high’, ‘moderate’ or ‘low’ study quality) omits valuable information about methodological differences between studies. There is no evidence to support the numeric ranking of studies based on overall study quality, particularly when independent items contribute towards the overall quality score of a study.
Assess risk of bias separately for different outcomes: subjective outcomes, such as pain, are more strongly influenced by participants’ and outcome assessors’ awareness of methodological phenomena such as knowledge of group assignment. An objective outcome, such as death or reinjury, is more resistant to the influence of this type of bias.
Incorporate risk of bias assessment findings into systematic review findings using quantitative or qualitative (descriptive) methods: use quantitative methods, such as sensitivity analyses or meta-regression, when applicable in meta-analyses to investigate the association of bias with meta-analysis effect estimates. In a systematic review without meta-analysis, visually present study findings according to risk of bias assessment findings or provide an informative discussion to speculate about the influence of risk of bias on the credibility of research findings.
Provide justification, based on the available risk of bias assessment criteria, to support each risk of bias judgement: online supplementary file 2 provides the final judgements allocated by assessors to each signalling question of RoB2 in our risk of bias re-assessment of eleven RCTs included in the systematic review by Kosik et al.16
Consistent, valid, and trustworthy assessments of risk of bias in RCTs are essential to judge the credibility of a body of evidence. Nearly all systematic reviews published in BJSM between January 2016 and May 2017 included a critical assessment of included studies. However, few systematic reviews used domain-level risk of bias assessment tools or performed domain-level risk of bias assessments. Outcomes were rarely considered separately in risk of bias assessments. Quantitative methods were infrequently used to incorporate risk of bias assessment findings into systematic review findings. We identified discrepancies between risk of bias and study quality assessment findings in our re-assessment of studies that were included a selected systematic review. Using a risk of bias assessment tool instead of study quality assessment tool generated different critical assessment findings, which affected inferences about the credibility of review findings. Risk of bias assessment tools must be correctly selected and administered, and assessment findings interpreted and incorporated, to inform the extent to which risk of bias likely impacts a body of systematically reviewed evidence.
Twitter @peanutbuttner, @marinuswinters, @EamonnDelahunt, @clare_ardern
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Patient consent for publication Not required.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.