Objective: To compile and critique research on the diagnostic accuracy of individual orthopaedic physical examination tests in a manner that would allow clinicians to judge whether these tests are valuable to their practice.
Methods: A computer-assisted literature search of MEDLINE, CINAHL, and SPORTDiscus databases (1966 to October 2006) using keywords related to diagnostic accuracy of physical examination tests of the shoulder. The Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool was used to critique the quality of each paper. Meta-analysis through meta-regression of the diagnostic odds ratio (DOR) was performed on the Neer test for impingement, the Hawkins−Kennedy test for impingement, and the Speed test for superior labral pathology.
Results: Forty-five studies were critiqued with only half demonstrating acceptable high quality and only two having adequate sample size. For impingement, the meta-analysis revealed that the pooled sensitivity and specificity for the Neer test was 79% and 53%, respectively, and for the Hawkins−Kennedy test was 79% and 59%, respectively. For superior labral (SLAP) tears, the summary sensitivity and specificity of the Speed test was 32% and 61%, respectively. Regarding orthopaedic special tests (OSTs) where meta-analysis was not possible either due to lack of sufficient studies or heterogeneity between studies, the list that demonstrates both high sensitivity and high specificity is short: hornblowers’s sign and the external rotation lag sign for tears of the rotator cuff, biceps load II for superior labral anterior to posterior (SLAP) lesions, and apprehension, relocation and anterior release for anterior instability. Even these tests have been under-studied or are from lower quality studies or both. No tests for impingement or acromioclavicular (AC) joint pathology demonstrated significant diagnostic accuracy.
Conclusion: Based on pooled data, the diagnostic accuracy of the Neer test for impingement, the Hawkins−Kennedy test for impingement and the Speed test for labral pathology is limited. There is a great need for large, prospective, well-designed studies that examine the diagnostic accuracy of the numerous physical examination tests of the shoulder. Currently, almost without exception, there is a lack of clarity with regard to whether common OSTs used in clinical examination are useful in differentially diagnosing pathologies of the shoulder.
Statistics from Altmetric.com
History and physical examination of patients with shoulder pain has traditionally been a cornerstone of the diagnostic process. Diagnosis based on physical findings is important to determine a treatment path and because the ability to correctly diagnose the source of shoulder pain can save the patient from further diagnostic tests that are more costly, painful or inconvenient. Physical examination tests or orthopaedic special tests (OSTs) have historically been an integral part of this process. However, despite the fact that studies on the diagnostic accuracy of OSTs in the shoulder have been published at an accelerated rate, the quality of these publications has been reported to be somewhat suspect.1–3 Further, many studies4–8 have questioned both the accuracy and reliability of the clinical examination especially as it relates to a pathoanatomical model. Despite the accelerated rate of publication of diagnostic accuracy studies, we are aware of only two previous systematic reviews that address multiple pathologies of the shoulder.9–11 The two-article review by Tennent et al10 11 gave thorough descriptions of each test but failed to examine the diagnostic accuracy of the tests and made no comment on the quality of literature supporting the use of individual OSTs. The review by Dinnes et al9 focused primarily on diagnostic imaging but did include 10 articles related to the use of OSTs in the clinical examination process. Our current systematic review includes over four times as many articles as Dinnes et al's study9
The purpose of this article is to subject the literature on OSTs of the shoulder to a systematic review and meta-analysis to provide clinicians with enough information to determine whether these OSTs are appropriate for clinical practice.
In order to make the retrieval of articles on diagnostic accuracy as comprehensive as possible, a generic search strategy as reported by Haynes et al12 was employed using the MEDLINE, CINAHL and SPORTDiscus databases (1966 to October 2006) through OVID. This generic strategy to find studies on diagnostic accuracy was then combined with a subject-specific strategy addressing the shoulder and pathologies of the shoulder, and physical examination (table 1).13 In addition to the database searches, personal files were hand searched by one of the authors (EH) for publications, posters or abstracts. The reference lists in review articles were cross-checked and all individual names of each special test were queried using Medline and PubMed.
All abstracts for 686 articles from Medline, 182 articles from CINAHL, 54 articles from SPORTDiscus and 7 articles from the hand search were reviewed by two of the authors (EH and SC) independently. Agreement between the two authors regarding which articles to read in full was determined by consensus. Articles were eligible for inclusion if the criterion standard was surgery, magnetic resonance imaging (MRI) or injection (acromioclavicular joint only), at least one physical examination test/special test was studied, if one of the paired statistics of sensitivity and specificity were reported or could be discerned for an individual test, and if the article was in the English language. Studies were excluded if the special test was performed under anaesthesia or in cadavers, if a group of special tests was assigned the status of “composite physical examination” or if the article was a review.
Each selected study was independently assessed by all reviewers. The reviewers were familiar with the literature and, thus, were not blinded to the authors, the date of publication or the journals in which they were published. If there was disagreement as to the final selection, a third author made the conclusive decision. A summary of our search procedure is presented in fig 1 and articles pulled for review based on a consensus of the authors are presented in tables 3–7.
After all relevant articles were obtained, the quality of the articles was assessed and data was extracted from each article. The quality of a study was determined by examining that study’s internal and external validity. Internal and external validity were evaluated (unmasked) by the primary author (EH) using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool developed by Whiting et al (table 2).14 QUADAS involves individualized scoring of 14 components. Each of the 14 steps is scored as “yes”, “no”, or “unclear”. Individual procedures for scoring each of the 14 items, including operational standards for each question have been published although a cumulative methodological score is not advocated.15 Past studies16–18 have used a score of 7 of 14 or greater “yeses” to indicate a high-quality diagnostic accuracy study whereas scores below 7 were indicative of low quality. Based on the experience of two of the authors (EH and CC) in use of the QUADAS tool in their textbook,19 we established a higher quality score as 10 of 14 or greater unequivocal “yeses”, whereas below 10 was associated with a lower quality study.
Meta-analysis was performed using dr-ROC software version 2.00 (dr2 Consulting, Glenside, PA, USA). Only studies presenting data representing both sensitivity and specificity were selected for pooling of results. Data were eligible for pooling in three special tests: Neer for impingement, Hawkins−Kennedy for impingement and Speed for labral pathology. Raw data from each individual study for these three tests was placed in a 2×2 table. The dr-ROC software was used to pool sensitivities and specificities using the inverse-variance method, which gives greater weight to individual studies with more subjects. Both the fixed effects and random effects models were used to pool information with similar outcome. The results of the fixed effects analysis are presented here. The diagnostic odds ratio (DOR) and the area under the curve (AUC) of the summary receiver operating characteristic (SROC) curve were both calculated as summary statistics indicating the overall diagnostic power of each of the three tests. Cochran’s Q was used to test for heterogeneity and the I2 statistic20 was used to quantify the percentage of variation across the studies that was associated with heterogeneity.
Forty-five articles met our inclusion criteria including the reporting of one of the paired statistics of sensitivity and specificity or the reporting of raw data that allowed the calculation of one of these paired statistics. Meta-analysis could be performed on three tests that had the requisite statistical homogeneity to draw summary conclusions from the meta-analysis: Neer for impingement, Hawkins−Kennedy for impingement and Speed for a SLAP lesion. Unfortunately, even for these well-known tests, there were only four articles for each test that met the inclusion criteria. The results of the meta-analyses are presented in the following subsections based on pathology.
Four21–24 of the six articles that addressed the diagnostic accuracy of OSTs for impingement were of lower design/reporting quality by our definition (table 3). Bias in these studies was mostly related to lack of masking on the part of the physician who performed both the OST and the criterion standard to confirm diagnosis. However, in two of the studies addressing impingement,21 24 the criterion standard of surgery was not used. Because of consistent bias in articles investigating OSTs designed to detect impingement, we guardedly report either the supraspinatus/empty can or infraspinatus tests may serve as confirmatory tests for impingement due to higher specificity.
In the meta-analysis, only the Neer test and the Hawkins−Kennedy test had homogeneous data from four articles each. The pooled sensitivity and specificity for the Neer test was 0.79 (95% confidence interval (CI) 0.75 to 0.82) and 0.53 (95% CI 0.48 to 0.58), respectively, and for the Hawkins−Kennedy test was 0.79 (95% CI 0.75 to 0.82) and 0.59 (95% CI 0.53 to 0.64), respectively. However, the pooled diagnostic odds ratio (DOR) for both tests is around 1 and the 95% confidence interval crosses 1 indicating that neither test has diagnostic utility for impingement (figs 2 and 3). Figures 4 and 5 show the fitted summary receiver operating characteristic (ROC) curves for the Neer and Hawkins−Kennedy tests, respectively. The area under the curve (AUC) for the Neer test is 0.74 (95% CI 0.70 to 0.78) and for the Hawkins−Kennedy test is 0.76 (95% CI 0.72 to 0.80) confirming the limited usefulness of these tests in diagnosis of impingement.
Rotator cuff integrity
Of the 15 studies that examined OSTs to assess rotator cuff integrity, nine22 25 27–33 examined the ability of the tests to detect any rotator cuff tear and five26 34–37 examined the ability to assess the tear of a specific rotator cuff muscle (table 4). One article38 reported the sensitivity only (not the specificity) of four OSTs to detect subscapularis lesions alone and also to detect any rotator cuff tear. Eight of the fifteen articles were of high design/reporting quality by our definition. Of the remaining seven articles, issues included a lack of stated inclusion/exclusion criteria, lack of masking of the physician to the results of the OST, and no reporting of intermediate or uninterpretable results. None of the 10 OSTs for rotator cuff pathology that were examined in more than one study proved consistently diagnostic. Two tests, the external rotation lag sign (ERLS) and the drop arm test, demonstrated value as specific tests for a tear of any rotator cuff muscle, and the supine impingement test may have value, when negative, in ruling out a rotator cuff tear. Further, two tests, the bear-hug and belly press tests, appear to be valuable as specific tests for ruling in a subscapularis muscle tendon tear when positive. The necessary bolus of information to perform a meta-analysis was not available since the studies either examined different OSTs or the studies that examined the same OST focused on the detection of different pathologies.
Glenoid labrum integrity/long head of biceps pathology
Fourteen of the 21 studies examining the diagnostic accuracy of OSTs for labral pathology focused on the detection specifically of superior labrum pathology (table 5). Of the remaining seven studies, three39–41 focused on detection of any labral pathology, two32 33 examined the integrity of the posterior labrum, two27 42 43 examined detection of either a long head of the biceps or superior labral anterior to posterior (SLAP) lesion and two 31 44 examined detection of biceps pathology alone. Twelve of the 21 articles were of high design/reporting quality by our definition. Similar to the rotator cuff article group, problems with design of the studies of the glenoid labrum included lack of stated inclusion/exclusion criteria, lack of masking of the physician to the results of the OST and no reporting of intermediate or uninterpretable results. Two OSTs for posterior labral tears, the Kim test and the Jerk test, and one for SLAP lesions, the biceps load II test, appear to be diagnostic but more studies are needed to investigate these tests.
In the meta-analysis, only the Speed test and only as an OST for a SLAP lesion had homogeneous data from four articles describing the diagnostic accuracy of this test. The summary sensitivity and specificity of the Speed test were 0.32 (95% CI 0.24 to 0.42) and 0.61 (95% CI 0.54 to 0.68), respectively. However, the pooled DOR for the Speed test is less than 1 and the 95% confidence interval (CI) crosses 1 indicating that the Speed test has no diagnostic utility for a SLAP lesion (fig 6). Figure 7 shows the fitted summary ROC curve for the Speed test. The AUC for the Speed test is 0.54 (95% CI 0.44 to 0.64) confirming the use of this test in the diagnosis of a SLAP lesion is no better than chance.
Five articles examined OSTs for instability with all articles specifically attempting to identify anterior shoulder instability (table 6). Three59–61 of the articles were of high design/reporting quality by our definition. There was not sufficient data to perform a meta-analysis in this sub-group of articles but the apprehension, relocation and anterior release tests appear diagnostic of anterior instability, especially when apprehension and not pain is used as the definition for a positive test.
Acromioclavicular (AC) joint pathology
Three articles50 64 65 examined the diagnostic accuracy of OSTs for AC joint pathology (table 7). Two64 65 of the three articles were of high design/reporting quality by our definition. The active compression test may be diagnostic of AC joint pathology but is troublesome in that as the quality of the study improves, the statistics monitoring diagnostic accuracy worsen. There was not sufficient power to perform a meta-analysis in this sub-group of articles.
We examined the quality of forty-five articles reporting on the diagnostic accuracy of OSTs of the shoulder. If our definition (10/14 or greater on the QUADAS(14) tool) of a higher quality article is accepted, then 22 of 45 articles were of higher quality with 10 of the higher quality articles being published since 2004.
Several studies have called into question the ability of clinicians to diagnose shoulder problems based on pathology.4–8 Coincidentally, OSTs are designed to do exactly that: to differentially diagnose pathologies of the shoulder. Therefore, it is imperative, before we abandon the idea that accurate diagnosis in the shoulder is possible, that we thoroughly examine the body of literature related to OSTs of the shoulder. The most powerful method to accomplish such a goal is meta-analysis. Meta-analysis of three tests (Neer, Hawkins−Kennedy and Speed) examining two pathologies (impingement and SLAP lesion) demonstrated that none of these tests is diagnostic for their stated pathology.
In addition to the meta-analyses, we elected to perform a comprehensive systematic review using the QUADAS14 document to assist with the quality assessment of 44 articles that reported on the diagnostic accuracy of almost 50 OSTs. For ease of analysis and for the convenience of the reader, we divided these OSTs into five categories based on pathology: impingement, rotator cuff pathology, labral/biceps pathology, instability and AC joint pathology. Impingement, the final common pathway for many pathologies of the shoulder, was subdivided in most of the studies into stages I−III based on Neer’s original classification.66 Stage I was defined as subacromial bursitis or tendonitis. Stage II was defined as a partial rotator cuff tear and Stage III was defined as a full-thickness or complete tear of the rotator cuff. Each successive stage is considered a worsening progression along a continuum of pathology. We elected to report the values from articles that specifically reported diagnostic values for Stage III impingement in the pathological category of “rotator cuff pathology” but reported all other individual stages and combined stage data in the “impingement” category. When examining likelihood ratios of the OSTs that attempt to detect impingement, there is one non-subacromial impingement test (internal rotation resistance strength test) and no subacromial impingement tests that improve the post-test probability of detecting subacromial impingement by a moderate or large amount.67 The internal rotation resistance strength test was examined in only one article23 judged to be of lower quality (QUADAS 8/14) so the reported values should be viewed with caution. As for the subacromial impingement tests, value can sometimes be found in an OST with either high sensitivity or high specificity.68 69 OSTs with high sensitivity are valuable as a screen where a negative test can rule out a pathology while OSTs with a high specificity can be used as a confirmatory test where a positive finding rules in the pathology.69 When viewed in this context as either a screen or a confirmatory test, no impingement test seems to serve as a screen and either the supraspinatus/empty can or infraspinatus tests may serve as confirmatory tests for impingement. We urge caution with this conclusion since there is only one study that examined the diagnostic accuracy of each of the supraspinatus/empty can and infraspinatus tests.
As previously mentioned, studies that reported on the diagnostic accuracy of Stage III impingement were grouped with studies reporting on rotator cuff pathology. One study4 estimated that rotator cuff lesions account for 70% of painful shoulder episodes. Of the nine OSTs for rotator cuff pathology that were examined in more than one study, none consistently exhibited likelihood ratios that would modify post-test probability of detecting a tear of the rotator cuff by a moderate or large amount.67 However, in one study37 with small sample size and numerous design faults, the hornblower's sign was diagnostic of severe degeneration or absence of the teres minor muscle. In this same study,37 the external rotation lag sign (ERLS) was found to be diagnostic of an infraspinatus muscle tear. A second study28 demonstrated value in the ERLS as a specific test for any rotator cuff tear. Further, two tests, the bear-hug and belly press tests, were shown in one well-conducted study34 to be valuable as specific tests for ruling in a subscapularis muscle tendon tear when positive, and the supine impingement test32 was found to be a sensitive test to screen for any rotator cuff tear. Unfortunately, despite the high quality of the Barth et al34 study and the Litaker et al32 study, both were underpowered according to Flahault et al.70
The glenoid labrum works in conjunction with the biceps and glenohumeral ligaments to provide shoulder stability.71 As with overlapping, pathology-based diagnoses such as impingement and Stage III rotator cuff tear, we made the decision to separate studies that examined the detection of instability from those that examined pathology of the glenoid labrum. Of all possible labral pathologies, the superior labral anterior-to-posterior or SLAP lesion was the most researched, being the focus of 12 of the 21 studies analysed in the category of labral and biceps pathology. Eleven OSTs were examined in more than one study and of those 11, the active compression, anterior slide, crank and compression−rotation tests had likelihood ratios indicating a moderate or large effect on the post-test probability of diagnosing a SLAP lesion. Unfortunately, each test failed to perform consistently well when examined in more stringent studies. Of the remaining single-study OSTs designed to diagnose a SLAP lesion, the biceps load I54 and the biceps load II55 modified the post-test probability by a large amount67 and appear to be useful in diagnosing a SLAP lesion. However, both tests have been examined in only one study each with small sample size and the biceps load I test54 was performed only on patients who had dislocated their shoulder. The biceps load II test55 may be the most promising but other OSTs like the anterior slide and active compression tests have performed far worse when used by other than the originator of the test. Beyond superior labral pathology, two studies57 58 examined OSTs for posterior labral tears. The Kim test,57 the posterior impingement sign58 and the Jerk test all modified post-test probability by a moderate to large amount.67 Again, however, we urge caution since each OST has been studied only once and in two of the three tests, the originator of the OST was also the author of the paper. As for OSTs that were studied as diagnostic for non-specific labral tears, the Crank test was again promising in one study39 but not in two others.40 52
Instability may come from a labral tear, trauma or a connective disease like Ehlers−Danlos syndrome. Instability can be multidirectional or unidirectional. Unidirectional instability is most often in the anterior direction.72 Not coincidentally, all of the instability studies in our review attempted to assess the diagnostic accuracy of OSTs for anterior instability. Three of the OSTs, apprehension, relocation and anterior release, could be viewed as merely a progression of the preceding test, respectively. With both the apprehension test and the relocation test, the use of “apprehension” as a positive test for anterior instability improves both the sensitivity and specificity over the use of “pain” as a positive sign. With the use of “apprehension” as a positive test for anterior instability, both tests modify the post-test probability a moderate to large amount.67 The anterior release test appears to be a strong diagnostic test regardless of whether “pain” or “apprehension” is used as the definition of a positive test. The remaining OSTs designed to detect anterior instability with an anterior supraspinatus tear, designated a superior labral anterior cuff (SLAC) lesion, appear to all be sensitive and have value as a screening test when negative but all of the data come from one underpowered study and no specificity values were reported.
Finally, AC joint pathology is a common source73 of shoulder pain and can be a contributor to outlet or subacromial impingement or can be an entity in itself, often confounding shoulder diagnosis. None of the OSTs appear valuable as a diagnostic test based on the likelihood ratios. However, in three studies,50 64 65 the active compression test was shown to be a specific test that would rule in the AC joint as a source of shoulder pain if positive. Pain with palpation of the AC joint may serve as a screening test for AC joint pathology when negative, but surprisingly, only one study65 with a small sample size exists to confirm this clinically common use of palpation.
After an extensive qualitative review and meta-analysis of OSTs of the shoulder, there are very few that appear to be diagnostically discriminatory and, therefore, useful in the clinic. Either the supraspinatus/empty can or infraspinatus test may serve as a confirmatory test for impingement. For rotator cuff tears, the hornblower's sign may be diagnostic of severe degeneration or absence of the teres minor muscle, the external rotation lag sign (ERLS) may be diagnostic of an infraspinatus muscle tear, and the bear-hug and belly press tests may be valuable for ruling in a subscapularis muscle tear. Further, two tests may have value as tests for any rotator cuff tear, the ERLS as a specific confimatory test and the supine impingement test as a screening test. Of all the pathologies of the shoulder, glenoid labrum pathology and more specifically SLAP lesions have generated the most enthusiasm in researchers. Many of the OSTs have shown great promise in studies conducted by the originator of the article only to prove far less diagnostic in future studies. With caution, we say that the biceps load II test is diagnostic for SLAP lesions. With regard to anterior instability, the apprehension, relocation and anterior release tests all appear to be diagnostic especially when apprehension is used as a “positive” test instead of pain. For AC joint pathology, pain with palpation may be valuable as a screen when negative due to high sensitivity and the active compression test may have value as a confirmatory test when positive due to its high specificity. Overall, these recommendations should be viewed as a guide and not an absolute since only two studies25 64 in our entire review are adequately powered to detect an OST that has high sensitivity or specificity70, one of which64 is a case-control design, which has been shown to overestimate diagnostic accuracy.74 75 We repeat the words of McAlister et al76 from 1999, “Clearly we need large methodologically robust studies on history and physical examination” (p1723).
What is already known on this topic
Orthopaedic special tests (OSTs) are used extensively in clinical practice to detect shoulder pathology.
OSTs are reported extensively upon in peer-reviewed articles and textbooks.
Varying levels of diagnostic accuracy have been reported for individual OSTs.
The literature examining the diagnostic accuracy of OSTs is generally of poor quality.
What this study adds
This is the most comprehensive systematic review with meta-analysis of the diagnostic value of individual orthopaedic special tests to date.
Meta-analysis for the Neer test of impingement, the Hawkins−Kennedy test of impingement and the Speed test for a SLAP lesion shows these tests to have no discriminatory ability for shoulder diagnosis.
Meta-analysis for other OSTs was not possible either because there is not enough diagnostic accuracy research about the test or because statistical heterogeneity between studies did not allow for summary results.
Recommendations are as follows:
The Hawkins−Kennedy test may serve as a screen and either the supraspinatus/empty can or infraspinatus test may serve as a confirmatory test for impingement.
The supine impingement test may be valuable, when negative, as a screen for any rotator cuff tear.
The ERLS test may have value as a specific test for any rotator cuff tear.
The hornblower's sign may be diagnostic of severe degeneration or absence of the teres minor muscle.
The external rotation lag sign (ERLS) may be diagnostic of an infraspinatus muscle tear.
The bear-hug and belly press tests may be valuable as specific tests for ruling in a subscapularis muscle tear.
The biceps load II test appears diagnostic for SLAP lesions.
The apprehension, relocation and anterior release tests all appear to be diagnostic of anterior instability, especially when apprehension is used as a “positive” test instead of pain.
For AC joint pathology, pain with palpation may be valuable as a screen when negative due to high sensitivity and the active compression test may have value as a confirmatory test when positive due to its high specificity.
Competing interests: None.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.