Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
Correction notice This paper has been amended since it was published Online First. The complete list of authors was inadvertently omitted and this has now been rectified.
Statistics from Altmetric.com
In 2006, we reviewed shoulder physical examination (ShPE) and in 2008 our work was published in this journal.1 This publication was followed by a series of either similar or otherwise redundant publications, addressing all or dedicated pathognomic components of shoulder testing.2,–,7 The majority of those subsequent articles did not meta-analyse the ShPE test's accuracy, evaluate risk of bias among the studies, or identify studies unique to our 2008 publication.1 The fact that so many review articles analysed the diagnostic accuracy of clinical shoulder tests in a period of three years speaks to the need to clearly address the question. ‘Which physical examination tests provide clinicians with the most value for diagnosis when examining the shoulder?’
Since 2006, there have been many changes necessitating an update of the original article. First and foremost, the publication of diagnostic articles on the use of ShPE tests in the clinical examination has continued at a brisk pace resulting in numerous new publications on the accuracy of established tests and the development of new tests. Next, the methodology by which a systematic review on diagnostic accuracy is conducted has been updated from the Quality of Reporting of Meta-analysis (QUOROM)8 with the publication of Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA).9 Third, the criterion standard method of performing a meta-analysis has become a unification10 of the bivariate model11 and the hierarchical summary receiver operating characteristic (HSROC) model.12 Finally, the method by which the quality of individual studies is examined has been updated from the original Quality Assessment of Diagnostic Accuracy Studies (QUADAS)13 to the newly published QUADAS-2.14 These changes over the last five years have been extensive but the goal with this systematic review and meta-analysis has remained the same: to analyse the literature on ShPE tests of the shoulder to careful analysis in order to determine their clinical utility in adult (18 or older) patients.
This systematic review with meta-analysis was conducted and reported according to the protocol outlined by PRISMA9 using a research question framed by PICOS methodology. PICOS is a pneumonic representing population (eg, adults), intervention (eg, diagnostic test), comparison (eg, control group), outcome (eg, accuracy) and study design (eg, cohort). In order to be eligible for this review, diagnostic accuracy studies, written in English, had to report both the sensitivity and specificity of ShPE tests in adults with shoulder pain due to musculoskeletal pathology. Excluded from this review, were articles using equipment or devices that are not readily available to most clinicians during physical examination and articles in which subjects were tested under anaesthesia or in which subjects were cadavers.
Since this review is an update of our previous work,1 the terms in our Medline and CINAHL search strategies remained the same with the exception that the search was confined to the dates November, 2006 through February, 2012. Our previous study dates were 1966 – October, 2006. Further, the original search was expanded, without date restrictions, to include two new databases: EMBASE and the Cochrane Library. A hand search was also conducted which included the authors' private collections and the searching of previous systematic reviews. Two authors (EH and AW) read titles and abstracts of all database-captured articles applying the a priori inclusion/exclusion criteria and agreement was measured using the κ statistic (figure 1). Disagreement was then resolved by discussion between the two authors and, in the event that agreement could not be reached, a third author (CC) served as the deciding vote. With the remaining articles, the same two authors (EH and AW) read the entire paper and again, a κ value was calculated to measure agreement as to which articles to retain for final analysis (figure 1). Once the final group of 32 articles was determined, 2x2 table data were extracted and saved for meta-analysis. Only data from studies, where the 2x2 data were reported or could be inferred from stated positive likelihood ratios, negative likelihood ratios, positive predictive values, and negative predictive values were retained for meta-analysis. If 2x2 data could not be discerned, the article was excluded from meta-analysis but still retained for systematic review and qualitative analysis.
Once the final group of articles was agreed upon, two authors (EH and AW) independently examined the quality of each article using the QUADAS-2 tool.14 QUADAS-2 is a 4-phase tool, the last phase of which assists authors of systematic reviews in rating: 1) bias and 2) applicability. The risk of bias is assessed in four key areas: patient selection, index test, reference standard, and flow and timing. Concern for applicability is assessed in three key areas: patient selection, index test, and reference standard. For both categories, risk of bias and concern for applicability, the individual criteria were classified as low risk, high risk, or unclear and the results were presented using tables from the QUADAS web site (www.quadas.org).
In order to maximise the potential for meta-analysis, we added 2x2 data from our first meta-analysis1 to data gathered from the 32 additional articles included in this review. Hierarchical summary receiver operating characteristic (HSROC) curve12 and bivariate11 models were used to combine estimates of sensitivity (SN), specificity (SP), positive likelihood ratios (+LR), negative likelihood ratios (−LR) and diagnostic OR (DOR) with their 95% CI. Sensitivity measures the proportion of actual positives which are correctly identified as such (eg, the percentage of sick people who are correctly identified as having the condition). Specificity measures the proportion of negatives which are correctly identified (eg, the percentage of healthy people who are correctly identified as not having the condition). Positive likelihood ratio (LR+) dictates how much the odds of the disease increase when a test is positive.15 The negative likelihood ratio (LR−) dictates how much the odds of the disease decrease when a test is negative.15 Diagnostic OR express the strength of association between the test result and disease. These models, in the absence of covariates, are different parameterisations of the same model10 and take into account the correlation between sensitivity and specificity and both the within and the between study variances.16 The 95% prediction region is graphically provided which is the given probability (ie, 95%) of including the true sensitivity and specificity of a future study.17 DerSimonian-Laird18 random-effects models were used where less than four studies were eligible for statistical pooling. Heterogeneity was explored graphically with forest plots and statistically with Cochrane-Q with p<0.10 to indicate significant heterogeneity. When appropriate, meta-regression or subgroup analysis using study level characteristics was used to explore heterogeneity with a p<0.10 to indicate a significant difference in stratified estimates. A p value of <0.10 was decided upon to determine a significance in stratified estimates due to the low power of the test used to detect differences in stratified estimates.19 A 0.5 was added to all four cells of the 2x2 table when a zero was encountered in any cell as suggested by Cox.20
Publication bias was analysed statistically with the Egger21 test with a p<0.05 to indicate significant publication bias. Threshold effects were tested using Spearman correlation coefficients.22 Influential studies on summary estimates were assessed with Cooks-d and standardised residuals according to Rabe-Hesketh23 with sensitivity analyses to determine if influential studies should be removed from the analyses. All statistical analyses were conducted in Stata 11 (Stata, College Station Texas, USA) by one of the authors (AG).
In reference to our previous meta-analysis,1 there were 32 new studies addressing the diagnostic accuracy of ShPE tests of the shoulder in adults (figure 1). A summary of the characteristics of each study is presented in table 1.
Twelve of these studies26 ,28 ,29 ,35 ,38 ,39 ,45,–,49 ,53 added 13 new tests to the literature, the majority of which attempted to detect a SLAP lesion. New tests were defined as those for which diagnostic accuracy statistics were reported for the first time in peer-reviewed literature. Clinically, many of these tests are not new. The 32 studies addressed the categories of: Rotator cuff tears (RCT's), Tendinopathy, Subacromial impingement, Instability, Labral tears, Biceps pathology, Stiffness-related disorders and Other. The most frequent topics of focus were RCTs, Tendinopathy, Subacromial impingement and Labral tears. Many would consider tendinopathy and impingement different labels for the same syndrome and further, that both labels capture a continuum of disease that includes RCTs. We concur with this thought but separated these pathologic entities in order to simplify analysis. Therefore, the rotator cuff tear group included those studies where diagnostic accuracy was examined inclusive of any size of tear or classification system used. Three studies25 ,30 ,33 in the RCT category addressed full-thickness tears, one study39 addressed massive RCTs, and six studies41 ,42 ,46 ,52,–,54 addressed RCT's regardless of size or classification. Of the 10 RCT studies, five used tests designed to test specific, individual muscles of the rotator cuff. An example of this methodology was the Kim et al42 study that examined the accuracy of the empty can for supraspinatus pathology, Patte's test for infraspinatus tendinopathy, and the lift-off for subscapularis tendinopathy (and Yergason's test for biceps tendinopathy).
There were some trends observed in categories other than RCTs. In the labral tear group, two studies examined the use of tests to detect any labral tear, while six studies addressed superior labral anterior to posterior (SLAP) lesions and one study37 addressed both labral tears generally and SLAP lesions specifically. Of the three studies in the Instability category,29 ,37 ,39 one39 addressed soft tissue-related instability and two29 ,37 addressed bony instability, a pathology attracting increased attention since our last review. The Stiffness-related group included studies addressing either glenohumeral OA or adhesive capsulitis. Two studies28 ,39 in this category actually used the same data for the shrug sign and published that data in two separate papers. All three of the stiffness-related papers28 ,39 ,48 addressed adhesive capsulitis, another new pathology in the diagnostic literature since our last review. Finally, the Other category consists of two articles38 ,39 on detecting acromioclavicular (AC) pathology and one addressing bony abnormality.47
The sensitivity and specificity of most ShPE tests examined in all 32 studies and the risk of bias in each study are summarised in table 2. In the interest of efficient reporting, test data was omitted from table 2 if diagnostic accuracy figures were reported for pathologies which the test was never intended to detect. For example, if an author reported values for the lift-off test (subscapularis) in a population with adhesive capsulitis, that data were not reported.
Quality assessment – risk of bias and concern for applicability
Each of the 32 papers qualifying for final review was scrutinised, via the QUADAS-2 (Q2),14 in the areas of risk of bias and concern for applicability (Appendix). Concern for applicability, for this review, was defined as concern for external validity, the degree to which results of a research study can be applied to practice. The two authors (EH and AW) independently used the Q2,14 blinded from each other's assessments. The number of low risk/concern scores was tallied into a total score for each article and agreement was calculated using a weighted κ statistic. The weighted κ was poor (κ=0.31 with 95% CI 0.10 to 0.52). Summaries of risk of bias and concern for applicability for each pathological group are presented in figure 2. The greatest risk of bias was most often associated with the Q2 items Patient Flow and Reference Standard. The greatest concern in the category of applicability was also the reference standard with the addition of the index test. Patient flow concerns become apparent when there was an indeterminate or excessive time between the issuing of the index test and the criterion standard, when patients received different reference standards, or when it was difficult to discern if all patients were included in the analysis. Most of the studies, where patient flow was an issue failed to note the length of time between the index test and reference standard, or did not make clear whether all patients were included in the analysis. Often, there was an inability to reconstruct the 2x2 tables accurately from the data reported in the original article. The concern for bias in the reference standard was most often due to a failure to use a double blind design (issuer of the criterion standard was not blinded to index test result) or the failure to use the criterion standard to confirm diagnosis. The obvious gain in popularity of diagnostic ultrasound (n=12 studies in this review) had the deleterious effect of increasing concern for bias since ultrasound is not the criterion standard for shoulder diagnosis.56,–,58 Lastly, the concern for applicability as it relates to the index test is because the authors failed to describe the index test.
Publication bias was not found to be evident with graphical or in statistical analysis. However, publication bias cannot be completely ruled out since these tests have decreased statistical power when analysing less than 10 studies.59 No significant negative correlations were found to indicate the influence of threshold effects. Table 3 presents the results of meta-analysis for the individual ShPE tests by diagnosis, number of studies and sample size for the analyses.
The Neer, Hawkins-Kennedy and painful arc tests for subacromial impingement were summarised for their diagnostic properties and associations. The strongest summary sensitivity was for the Hawkins-Kennedy test (0.80; 0.72, 0.86). However, the value was merely on the sensitivity threshold (80%) for assisting in ruling out subacromial impingement but because of poor specificity, the LR- value shows this test to have little effect on post-test probability to rule out subacromial impingement when negative. In fact, none of the three diagnostic tests demonstrated the likelihood ratios that would be unlikely to result in important changes in post-test probability. The pooled DOR for any of these three tests indicates the discriminative diagnostic ability to determine a positive test result among those with subacromial impingement when compared with those without subacromial impingement is unlikely to occur. Figure 3 (Neer), figure 4 (Hawkins-Kennedy) and figure 5 (painful arc) illustrate the included studies with both the 95% confidence and prediction regions indicating the probable wide variability of the true sensitivity and specificity in future studies.
Meta-regression was conducted for both the Neer and Hawkins-Kennedy tests in order to determine if the summary DOR was biased as a result of differing reference standards. For the Neer test, there was a substantially greater DOR among the studies which used the gold standard of surgery for index test confirmation (4.85 ((95% CI 3.46 to 6.79)) than other reference standards (1.28 ((95% CI 0.31 to 5.19)). The ratio of DORs was strong (3.79 ((95% CI 0.87 to 16.14)) and the stratified estimates were statistically significant (p=0.07). Similarly, the DOR for the Hawkins-Kennedy test was stronger among those studies with the gold standard of surgery (6.41 ((95% 3.33 to 12.35) than for studies using other than the gold standard (3.14 ((95% 1.37 to 7.22)). However, the stratified estimates were not significantly (p=0.18) different from one another.
None of the 8 ShPE tests for which meta-analysis was possible (table 3) demonstrated sensitivity values that would likely rule out a SLAP lesion with a negative test. Yergason's test had the strongest summary specificity (95.3; 90.6,98.1), but again, the sensitivity was so poor that the LR+ demonstrates insignificant ability of this test to rule in a SLAP lesion when positive. All eight diagnostic tests for a SLAP lesion had likelihood ratios and DORs that were weak and their CI contained the null value (table 3).
The active compression test analysis found the O’Brien et al60 study to have a large Cooks-D and standardised residuals influencing the summary estimates. Cooks-D is a measure of the influence that a particular study may have on the model parameters and can be used to check for particularly influential studies. Sensitivity analysis, with removal of the O’Brien et al60 study, resulted in substantial attenuation of the DOR from 3.14 (95% CI 0.42 to 23.40) to 1.19 (95% CI 0.76 to 1.86). As such, this study was not included in summary estimates for the Active Compression test. Figure 6 illustrates the HSROC curves of the Active Compression test both with and without the outlier study.60
Statistical pooling was done individually for three tests for the diagnosis of anterior instability: the apprehension, relocation and surprise tests. The surprise test demonstrated the strongest sensitivity (81.8; 69.1, 90.9), and therefore, negative likelihood ratio (0.25; 0.08–0.78)) that would likely rule out anterior instability when negative. All three tests demonstrated the ability to rule in anterior instability due to strong specificity. The apprehension test had the strongest positive likelihood ratio (17.2; 10.02, 29.55) and was the only one of the three in which the CI did not contain the null value. The apprehension test had the strongest DOR (53.6; 24.3, 118.3), indicating some evidence for this test's overall diagnostic discriminative performance.
Significant heterogeneity was found in the properties and associations for the relocation test. Subgroup analysis, accomplished by removing the study by Lo et al61 based upon the non-criterion reference standard used, did not improve the overall heterogeneity.
In pooled analyses, the crank test for labral tear demonstrated limited ability to rule in a labral tear with a +LR of 2.4 and specificity of 76%, indicating a likely small change in post-test probability.
In pooled analyses, the Hawkins-Kennedy test for tendinopathy demonstrated no evidence for the ability to rule in or out, change post-test probability or have overall diagnostic discriminative performance.
What this study adds
This is the first meta-analysis to study ShPE tests and use the QUADAS 2 document to assist in the qualitative review and the HSROC/bivariate models for meta-analysis
There is less optimism that the biceps load II is diagnostic for SLAP lesions
The belly-off and modified belly press tests may be helpful in diagnosing subscapularis tendinopathy
The bony apprehension test may help diagnose bony instability
The olecranon-manubrium percussion test may be useful in a traumatic injury for bony abnormality requiring referral for x-ray
The passive compression test may be helpful in diagnosing a SLAP lesion
The modified dynamic labral shear test may be diagnostic of labral tears
The lateral Jobe test may be useful for diagnosing a rotator cuff tear
The shrug sign appears to be a sensitive test for stiffness-related disorders (osteoarthritis and adhesive capsulitis) as well as rotator cuff tendinopathy
The passive distraction test may be able to rule in a SLAP tear if positive
This is the first study on diagnostic accuracy of which we know that has incorporated HSROC/bivariate models as the criterion standard during performance of a meta-analysis of ShPE tests. We feel that the use of this criterion standard promotes increased attention on and isolation of studies that demonstrate results dramatically outside others of similar context. Of particular interest, is the dramatic change in both the 95% CI and 95% prediction region of the active compression test for a SLAP lesion when the original study60 is eliminated (figure 6). Further, this study60 is a good example of the effect of bias on estimates of diagnostic accuracy given that the publication possesses examples of at least seven kinds of bias. Most notable of these biases, is partial verification bias which has been shown to overestimate the diagnostic accuracy of a test.62
For each diagnostic category, the overall results of this systematic review and meta-analysis indicate that a few tests are helpful to confirm or screen for a given diagnosis. There is a preponderance of evidence about individual physical examination tests that could not be combined for the meta-analysis. For those tests, we have used the diagnostic values and risk of bias from the Q2 to determine which tests are recommended for use as a screen or those recommended as a confirmatory test using the benchmarks of specificity >80%, sensitivity >80%, LR+ ≥ 5.0 and LR− ≤0.20. The list is short, and confidence in the diagnostic accuracy estimates is tenuous.
From the meta-analysis portion of this review, the Hawkins-Kennedy initially appears to be of value in ruling out subacromial impingement when negative. However, the LR− is poor and further, a strong argument can be made that subacromial impingement is not a valuable diagnosis but rather a cluster of diagnoses.63 The diagnosis of subacromial impingement encompasses such a broad range of pathologies, from bursitis to a complete rotator cuff tear,64 that a label of subacromial impingement may not help guide treatment.65 Yergason's test, used for detection of a SLAP lesion, has high (95%) pooled specificity. However, the sensitivity is so low, that a positive test modifies the post-test probability of detecting a SLAP lesion only a small amount. In a similar perspective to subacromial impingement, authors have argued that tests results for SLAP may be effected by the percentage of different forms of Snyder classifications present within the sample.50
Therefore, the only tests that appear to have good clinical utility are the apprehension, relocation, and surprise tests to diagnose anterior instability and these tests are primarily a continuum of the apprehension test. When a patient registers apprehension with this test, the relocation manoeuvre should then decrease apprehension, whereupon, the relocation force is removed causing a surprised reaction (surprise test) by the patient as the apprehension reappears.
While the results of the meta-analysis were, perhaps, not inspiring to the clinician searching for diagnostic answers, there are some individual tests that warrant further investigation. The posterior apprehension test for posterior instability demonstrated a higher specificity and positive likelihood ratio but these values came from a high bias study.39 Another highly specific test, but from a low bias study45 is the passive distraction test for a SLAP lesion. This test may rule in a SLAP lesion when positive. Sensitive tests of note are the shoulder shrug sign, for stiffness-related disorders (osteoarthritis and adhesive capsulitis) as well as rotator cuff tendinopathy and the Whipple test for massive rotator cuff tears. However, the diagnostic properties of the Whipple test come from a high bias study.39 Other tests of possible value from high bias studies included the AC resisted extension,39 the resisted belly press,38 and coracoid palpation.48 There are six additional tests with higher sensitivities, specificities, or both but caution is urged since all of these tests have been studied only once and more than one ShPE test (ie, active compression, biceps load II) has been introduced with great diagnostic statistics only to have further research fail to replicate the results of the original authors. The belly-off and modified belly press tests for subscapularis tendinopathy, bony apprehension test for bony instability, olecranon-manubrium percussion test for bony abnormality, passive compression for a SLAP lesion, and the lateral Jobe test for rotator cuff tear give reason for optimism since they demonstrated both high sensitivities and specificities reported in low bias studies. Finally, one additional test was studied in two separate papers.35 ,50 The dynamic labral shear may be sensitive for SLAP lesions but, when modified, be diagnostic of labral tears generally.
Looking back to our initial publication and combining that data with the current review certainly expands the clinician's diagnostic arsenal. The external rotation lag sign continues to be recommended as it was in 20081 to confirm full-thickness rotator cuff tears of the infraspinatus. The hornblower's sign may be diagnostic of severe degeneration or absence of the teres minor muscle, and the active compression test may have value as a confirmatory test for AC joint pathology when positive due to its high specificity.
Despite some cause for optimism when looking at some of the individual studies and tests, the more prudent method may be to look at clusters or combinations of tests, since that resembles more closely, the way in which most ShPE tests are used in the clinic. Table 4, while not all-inclusive, shows the best test combinations to date for detecting various pathologies.
Unfortunately, even many of these test clusters modify the post-test probability by a small to minimal amount. Of note in this group of clustered tests is the combination of age>39, painful arc, and self-report of popping and clicking32 and the combination of the apprehension and relocation tests,68 both of which produce a large post-test shift toward the diagnoses of supraspinatus tendinopathy, and anterior instability, respectively.
Any review is limited by the quality of studies contained therein. Many of the studies in this review had issues with the reference standard and subject flow and timing. There was clearly a rise in the use of diagnostic ultrasound as a criterion standard, and evidence to supports its use is currently poor.56,–,58 Further, we limited our articles to those in the English language which may make this review more prone to dissemination bias. However, publication bias was not found to be evident with graphical or in statistical analysis. Finally, this is the first meta-analysis on diagnostic accuracy of ShPE tests that was performed using the Q2 document. The original authors piloted the Q2 on five studies and found that reliability varied considerably.14 Our weighted κ (κ=0.31; 0.10, 0.52) was likewise only fair.
Based on data from our original review1 and this update, the use of any single ShPE test to make a pathognomonic diagnosis cannot be unequivocally endorsed due to continued quality issues in publications. Some ShPE tests are beginning to stand the tests of scrutiny and time but there are far more tests that need to be validated in more than one study. Combinations of ShPE tests provide better accuracy, but marginally so. These findings seem to provide support for stressing a comprehensive clinical examination including history and clinical examination. However, there is a great need for large, prospective, well-designed studies that examine the diagnostic accuracy of the many aspects of the clinical examination and what combinations of these aspects are useful in differentially diagnosing pathologies of the shoulder.
The authors would like to acknowledge Ms Connie Schardt for her invaluable assistance in the search process and the authors from the original paper whose initial work was foundational: S Campbell, A Morin, M Tamaddoni, C T Moorman III.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
Correction notice This paper has been amended since it was published Online First. The complete list of authors was inadvertently omitted and this has now been rectified.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.