Background Shoulder pain in the general population is common and to identify the aetiology of shoulder pain, history, motion and muscle testing, and physical examination tests are usually performed.
Objective The aim of this systematic review was to summarise and evaluate intrarater and inter-rater reliability of physical examination tests in the diagnosis of shoulder pathologies.
Methods A comprehensive systematic literature search was conducted using MEDLINE, EMBASE, Allied and Complementary Medicine Database (AMED) and Physiotherapy Evidence Database (PEDro) through 20 March 2015. Methodological quality was assessed using the Quality Appraisal of Reliability Studies (QAREL) tool by 2 independent reviewers.
Results The search strategy revealed 3259 articles, of which 18 finally met the inclusion criteria. These studies evaluated the reliability of 62 test and test variations used for the specific physical examination tests for the diagnosis of shoulder pathologies. Methodological quality ranged from 2 to 7 positive criteria of the 11 items of the QAREL tool.
Conclusions This review identified a lack of high-quality studies evaluating inter-rater as well as intrarater reliability of specific physical examination tests for the diagnosis of shoulder pathologies. In addition, reliability measures differed between included studies hindering proper cross-study comparisons.
Trial registration number PROSPERO CRD42014009018.
Statistics from Altmetric.com
Shoulder pain in the general population is common, with a reported prevalence of 7–26%.1 Patients suffering from shoulder pain often are limited in performing activities of daily living and therefore seek help from healthcare professionals, resulting in substantial utilisation of healthcare resources.2 ,3 To identify the aetiology of shoulder pain, history, motion and muscle testing, and physical examination tests are usually performed.4 Physical examination tests aim to reproduce the patients symptoms (pain), which contrasts to other physical examination tests and outcome tests performed by clinicians, such as range of motion and muscle tests, as reviewed by Roy and Esculier.5 Several systematic reviews have evaluated the validity of physical examination tests, concluding that most research is of insufficient methodological quality or that lacks consistently solid measures for validity obtained from studies with higher methodological quality.6–11 Sciascia et al12 performed a survey among orthopaedic shoulder surgeons and identified that a wide variety of tests were used to evaluate patients with shoulder symptoms.12 Notably, lacking evidence regarding the diagnostic accuracy did not preclude use of the tests in clinical practice.12 However, both validity, and reliability is of concern if physical examination tests are applied.13 ,14 A poor reliability has a negative influence on the test’s validity,15 thus a test will not be valid if it does not measure consistently.16 Tests with insufficient reliability (eg, training of examiners, variation in test execution due to examiners) might be the reason for varying results regarding the validity of physical tests.17 To date, one systematic review published in 2010 evaluated the reliability of physical examination tests for the shoulder,18 concluding that there is no consisting evidence that any tests have acceptable levels of reliability. Within the past few years more research on the reliability of physical examination tests has been performed and recent studies have been published. Therefore the objective of this systematic review is to systematically summarise and critically appraise research on the reliability of physical examination tests used for the diagnosis of shoulder pathologies.
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were used.19 The PRISMA statement aims to improve the reporting of systematic reviews and meta-analyses. This systematic review was registered a priori within the International Prospective Register of Systematic Reviews (PROSPERO; CRD42014009018).
Studies assessing the intrarater and/or inter-rater reliability of specific physical examination tests for the diagnosis of shoulder pathologies applied as a single test or in combination with other tests were included if written in English or German. Studies on patients of every age and setting were considered eligible. Studies were excluded if they did not name or describe the physical tests or did not refer a source that did so. Studies were excluded if the overall reliability of a group of tests was reported but individual tests were not specified/named or if the authors made use of generic terms such as physical examination to describe an unspecified combination of physical tests. In addition, studies were excluded if only asymptomatic patients were evaluated or if the physical examination test was performed under anaesthesia or immediately postoperative. Animal studies and cadaveric studies and studies which used device supported testing procedures (defined as devices which are deemed too expensive or time-consuming for daily clinical practice) were also excluded.
A comprehensive systematic literature search in the following databases via the Ovid interface from inception until 18 March 2014 was performed, accessed via the Saxon State and University Library Dresden (SLUB): MEDLINE from 1946, EMBASE from 1974, and the Allied and Complementary Medicine Database (AMED) from 1985. The search strategy included terms about diagnostic tests, the conditions of interest, structures at risk, and reliability (see online supplementary appendix). Additionally, Physiotherapy Evidence Database (PEDro) was searched with a modified search strategy using the body part filter (upper arm, shoulder or shoulder girdle) in combination with the terms for reliability as used for the search in MEDLINE, EMBASE and AMED. Furthermore, reference lists of all eligible articles were screened for further relevant studies. A search update using the same search strategy and electronic databases was conducted on 20 March 2015 to identify recently published articles. The original search strategy was designed to identify studies on the reliability of specific physical examination tests evaluating specific structures (eg, rotator cuff tear) and general physical examinations tests (eg, strength or range of motion testing for the shoulder as well as shoulder girdle). However, in this review only the results of physical examination tests for the diagnosis of shoulder pathologies are reported.
Study selection and data abstraction
Identified titles and abstracts were screened by two independent reviewers (TL and CK), according to the described inclusion criteria. Subsequently, full texts were checked independently for eligibility by the same two reviewers. Any disagreements were resolved by consensus and if needed by a third reviewer (JS). Before titles and abstracts screening initiation, two subsamples consisting of randomly selected 50 titles and abstracts from all identified articles were performed. Afterwards the two reviewers (TL and CK) discussed their procedure to avoid following inequalities and started with the titles and abstract screening after an almost perfect agreement (according to classification system proposed by Landis and Koch20) was reached in the second pretest subsample (subsample 1: Cohen’s κ=0.22, percentage agreement=88.00%; subsample 2: Cohen’s κ=1.00, percentage agreement=100.00%). Data extraction was done by one reviewer (TL) and checked in duplicate by the other reviewer (CK). For standardised data extraction, forms were used, which were created according to the Quality Appraisal of Reliability Studies (QAREL) checklist.21 Data on the objectives, patients, raters, physical examination tests, outcome variables and results were extracted. Authors of primary studies were contacted if additional data was needed. In case the authors provided the requested information, the appropriate reliability measures were calculated if possible.
Quality assessment of all included studies was carried out independently by the two reviewers (TL and CK) using the QAREL checklist.21 QAREL is especially designed for the quality assessment of reliability studies and is considered to be reliable for use.22 The checklist consists of 11 items evaluating seven methodological domains of reliability studies (spectrum of patients and of examiners, examiner blinding, time interval between repeated measures, test application and interpretation, order of examination and statistical analysis of the data). Items can be answered with ‘yes’, ‘no’ or ‘unclear’ (and in addition if necessary with ‘not applicable’). Fulfilled quality aspects of studies are indicated with a ‘yes’, whereas not fulfilled aspects with a ‘no’. If insufficient information is provided to properly judge the quality aspect of studies, this is indicated with an ‘unclear’. As recommended,22 criteria by which judgments were made for each item of QAREL were a priori defined and tested by the two reviewers (TL and CK) (see online supplementary appendix).
If both intrarater and inter-rater reliability of a physical examination test was evaluated in one single publication, the quality assessment using QAREL was performed separately for (a) the intrarater and (b) the inter-rater reliability to account for specific possibilities for bias.
Reliability measures are presented as reported by the authors of primary studies. Cohen's κ values <0.00 indicate poor, 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial and >0.81 almost perfect agreement.20 Intraclass correlation coefficient (ICC) values <0.40 represent poor, values between 0.40 and 0.75 represent fair to good, and values above 0.75 represent excellent reliability.23
The agreement among reviewers of title and abstract screening was measured with percentage agreement and Cohen’s κ statistic (95% CI), prevalence-adjusted bias-adjusted κ (PABAK), positive and negative percentage agreement as well as bias- and prevalence index. The agreement among reviewers of methodological quality using the QAREL tool was measured with percentage agreement and Cohen’s κ statistic (95% CI).
Meta-analysis of Cohen’s κ was performed according to the statistical framework proposed by Sun,24 if raters in studies considered eligible for meta-analysis were clearly blinded to other raters.
All statistical analyses were performed using R V.3.2.0 (The R Project for Statistical Computing, Vienna, Austria) and RStudio (RStudio, Boston, Massachusetts, USA).
Results of the literature search and study selection are shown in the PRISMA flow chart (figure 1). Agreement among reviewers regarding screening of titles and abstracts yielded a Cohen’s κ of 0.76 (CI 0.70 to 0.82) and of full-text articles 0.68 (CI 0.54 to 0.81); additional reliability statistics are presented in figure 1. After full texts were reviewed, 18 publications met our criteria. These 18 studies presented data on 62 different physical examination tests and test modifications. Characteristics of included studies are summarised in online supplementary table S1.
All included studies were prospective and inter-rater reliability was assessed in 17,25–41 and 1 study evaluated inter-rater and intrarater reliability.42 In 16 of the 18 publications, primary care settings were used.25–27 ,29–36 ,38–42 One study was performed in a tertiary care setting37 and one in undefined care settings.28
Results of methodological assessment using QAREL for all included studies are summarised in table 1. Methodological quality ranged from 2/11 rating29 to 7/11 total positive ratings30 and from 2/11 ratings37 to 8/11 total unclear ratings.29 Recruitment of raters was not specified in any of the included studies. In nine included studies, patients were recruited consecutively25 ,26 ,29–34 ,37 and in one study through convenience sampling.42 In three studies patients were referred35 ,36 ,38 and in five studies the recruitment protocol was unclear.27 ,28 ,39–41 Blinding of raters to the findings of other raters was unclear in 4 of the 18 inter-rater reliability studies.29 ,34 ,35 ,42 In the intrarater reliability studies, the blinding of raters to their own prior findings was judged as unclear due to insufficient information.42 Blinding to further clinical information was stated in 3 of the 18 included studies.28 ,30 ,35
Percentage agreement among reviewers regarding the rating of methodological quality of included studies using QAREL ranged for the different QAREL items from 74% to 100%. The overall agreement between raters of the methodological assessment using QAREL yielded a Cohen’s κ of 0.86 (CI 0.81 to 0.92).
Physical examination tests
Physical examination tests for the diagnosis of shoulder pathologies were categorised as follows: acromioclavicular dysfunction tests, impingement tests, torn labrum/instability tests, and torn rotator cuff/impingement tests. Altogether 62 different physical examination tests were evaluated in the studies included in this systematic review (see online supplementary table S2 and table 2). Since only one study evaluated the intrarater reliability of physical examination tests for the diagnosis of shoulder pathologies (table 2),42 comparisons between intra- and inter reliability was not possible. Cohen’s κ was the most used reliability measure and was used in 77% of studies with categorical outcomes. Strength of agreement of acromioclavicular dysfunction tests ranged from slight to moderate agreement, impingement tests ranged from slight to almost perfect, torn labrum/instability tests ranged from poor to almost perfect and torn rotator cuff/impingement ranged from fair to almost perfect (see online supplementary table S2).
Meta-analysis identified extensive heterogeneity for the Hawkins-Kennedy Test, Neer Test, Empty Can Test/Supraspinatus Test, Painful Arc Test (figures 2⇓⇓–5) with I2 values >0.75, which can be interpreted as ‘considerable heterogeneity’ according to the Cochrane Handbook.43 Results from meta-analysis indicate moderate-to-substantial inter-rater reliability for the Hawkins-Kennedy Test, Neer Test, Empty Can Test/Supraspinatus Test and the Painful Arc Test.
This systematic review identified 18 articles, which examined the reliability of 62 physical examination tests for the diagnosis of shoulder pathologies with varying inter-rater reliability. Intrarater reliability was investigated in only one study assessing four different tests, reporting almost perfect reliability. The included studies were of low methodological quality according to the QAREL tool.21 Meta-analysis identified extensive heterogeneity among studies for physical examination tests using the I2 statistic,44 ,45 thus the findings of the meta-analysis may be inaccurate and need to be interpreted with caution. Results from meta-analysis indicate moderate-to-substantial inter-rater reliability for the Hawkins-Kennedy Test, Neer Test, Empty Can Test/Supraspinatus Test and the Painful Arc Test. These examination procedures (and other tests evaluated in this systematic review) need to be used with great caution in terms of diagnostic value and clinical decision-making, because of limited reliability, and also because it lacks validity.9–11
Physical examination tests contribute towards an overall clinical decision process that includes the patients’ history, presentation and other tests and is therefore essential for clinical decision-making in patients with shoulder disorders. Physical examination manoeuvres are extensively described in the literature to be indicative of specific shoulder pathology such as rotator cuff disease, instability, and labral tears.8–11 ,46–48 Prior results on diagnostic accuracy of physical examination for the shoulder are variable and therefore offer limited guidance to the clinician when assessing a patient with shoulder pain.8 ,49 This has led to reliance on imaging for diagnostic purposes. Such practice is expensive and possibly inaccurate since imaging abnormalities are demonstrated in asymptomatic individuals as well.50–52 Poor inter-rater and intrarater reliability is likely to be one among multiple reasons for variability in data on diagnostic accuracy when performing physical examination manoeuvres. Thus, findings from our study have implications for clinical practice and future studies on diagnostic accuracy and reliability testing. However, to perform physical examination tests should depend not only on its reliability values. It should be noted that reliable tests are not necessarily valid. For example, highly standardised tests conducted by highly trained professionals are likely more reliable in contrast to poorly standardised tests or tests conducted by untrained persons. Despite lacking standardisation or training of raters, physical examination tests may not measure the ‘truth’ because of multiple reasons. Thus, a high rate of false-positive (or false-negative) test results might occur, although this happens in a reliable manner. Furthermore, the validity of tests conducted by inventors of tests (or highly trained professionals) is likely not comparable to the validity of clinicians in routine care (not highly trained professionals). Hence, the reliability between this groups is inevitably lower than within the groups, because the validity depends on test performance and clinician experience.
May et al18 conducted a systematic review on the reliability of physical examination tests used to assess shoulder pathologies. In contrast to this systematic review, May et al18 used a self-developed tool for the quality assessment of included studies. The QAREL which was used in this systematic review, is a consensus-based developed21 and reliable tool,22 and has been used in recently published systematic reviews.17 ,53–56
Methodological considerations and generalisability of results
The overall generalisability of this review results is limited due to the low quality of included studies (table 1). Reliability measures reported in included studies might be inflated due to the insufficient methodology (missing blinding and randomisation of raters and patients) and statistical analysis (missing adjustment of κ values if the prevalence differs from 50%) of included studies. Highlighting this, altogether 41.63% of the QAREL items were judged as ‘unclear’ during critical appraisal, representing insufficient reporting of methodological aspects within primary studies. Generalisability of results from included studies is limited due to differences in test conduct as well as interpretation of physical examination tests. Since test conduct and interpretation of test results differ between studies, even if the same physical examination test was evaluated in the different studies, results from individual studies should be interpreted with caution and generalisability of such results is limited.
Rater experience and training status can have a major impact on reliability results,13 ,53 ,57 ,58 but was not reported in 11 of the 18 studies included in this review.26–30 ,32 ,35 ,36 ,40 ,41 Blinding of raters to the reference standard, clinical information, and additional cues was reported sufficiently in most studies.
Reliability measures may be inflated in retrospective studies, since patients might be preselected.21 Therefore prospective studies using consecutive or randomly sampled patients should be considered for being of higher methodological quality.17 ,59 However, only half of the included studies recruited patients consecutively.25 ,26 ,29–34 ,37
For the reporting of reliability study results, the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were published.58 GRRAS intends to improve the quality of reporting, similar to the Standards for Reporting of Diagnostic Accuracy (STARD) initiative for studies of diagnostic accuracy.60 ,61 To differentiate between poor reporting quality and poor methodological quality of studies is sometimes limited, but clearly poor reporting will negatively impact the proper judgement of methodological quality of studies. Therefore, some of the include studies might be judged to be of better methodological quality if reported in accordance with GRRAS. However, none of the studies included in this systematic review were published at least 1 year after the publication of GRRAS, which might be considered enough time to allow authors to use GRRAS.
An a priori sample size calculation is recommended for reliability studies,13 ,62 ,63 but none of the studies included in this systematic review performed such a calculation (respectively no study reported that an a priori sample size calculation/post hoc power analysis was performed), potentially limiting the value and generalisability according to statistical considerations. Studies evaluating insufficient sample sizes might not be capable of producing precise estimates of agreement; therefore sample size calculations are needed and in addition will help the reader to interpret studies results.
Since prevalence rates of shoulder pathologies in routine care are presumably not equally distributed, agreement of categorical judged physical examination tests might occur purely by chance.64 Relative reliability measures which take the agreement occurring by chance into account, such as Cohens' κ, are therefore necessary.62 Cohen’s κ as a frequently used relative reliability measure has been criticised by several authors because Cohen's κ is affected by prevalence of test result categories.62 ,65–68 If the prevalence (of the condition in the population under evaluation) differs from 50%, this will maximise the divergence between absolute (proportion of observed agreement) and relative reliability measures.66 ,67 To solve this, Byrt et al68 introduced the PABAK, prevalence and bias index. PABAK as a reliability measure, however, relates to a hypothetical situation without any prevalence as well as bias effects.62 Notably, only 125 of the 17 studies25–37 ,39–42 which reported reliability measures for categorical data provided alongside Cohen's κ values PABAK, prevalence and bias index values.
Two included studies calculated the ICC to report on the reliability of physical examination tests under evaluation;33 ,38 however, in only one study33 this was statistically appropriate since it was based on continuous data.14 ,69 Measures of uncertainty and CIs were not reported in the two studies using the ICC.
It should be acknowledged that generally accepted classification systems for reliability measures are currently lacking, although the classification systems proposed Landis and Koch20 for categorical data and Fleiss23 for continuous data are widely used. Therefore within this review these classification systems were used to categorise the strength of agreement for individual physical examination tests. In addition, minimal requirements regarding clinical acceptable values of reliability measures are currently not available neither for categorical (eg, Cohen’s κ) nor continuous data (eg, ICC),14 ,16 ,70 but would be of great help for clinicians to decide which physical examination tests should be considered reliable for clinical use.
Implications for further research
Future reliability studies evaluating physical examination tests used for the diagnosis of shoulder pathologies should calculate and report for dichotomous outcome data contingency tables, absolute (proportion of positive as well as negative agreement) and relative reliability measures (κ, maximum κ, PABAK (all with 95% CI)), prevalence and bias index as recommended by several authors.58 ,62 ,65–68 For continuous data, ICC values (with 95% CI) and SE of measurement should be calculated and reported.71 The aforementioned reliability measures should be calculated and reported to enable readers to interpret, compare and adopt the reliability measures into clinical practice and research. Furthermore, reliability studies should be registered prospectively in trial registers such as the International Clinical Trials Registry Platform (ICTRP) or ClinicalTrials.gov, to ensure transparency and prospective study designs with consecutive or randomly sampled patient samples based on a priori sample size calculation should be used in reliability studies.
The reliability of physical examination test cluster(s) as described by Hegedus et al4 is likely more beneficial for clinical practice in contrast to the evaluation of single tests. In addition, it seems valuable to evaluate the reliability of the use of physical examination tests only as pain or symptom-provoking procedures along with other physical movements identified by the patient that reproduce their shoulder pain as described from Lewis.72 ,73
Furthermore, an international consensus is needed regarding minimal standards for the conduct of reliability studies and reporting of studies needs to be in accordance with GRRAS.58
Conclusions based on the meta-analysis results are limited due to heterogeneity and the small number of included studies. In addition, studies were included in the meta-analysis if the blinding of raters to other raters was judged as ‘unclear’ in the assessment of methodological quality using QAREL. This further limits interpretation of summary measures and results may be inaccurate and need to be interpreted with caution.
One study was excluded owing to language restrictions,74 thus the possibility of a language bias might exist.
Even though authors were contacted if incomplete reliability statistics were reported in primary studies, due to several reasons not all contacted authors were able to provide the data.
Numerous physical examination tests used for the diagnosis of shoulder pathologies are described in the literature. Overall, there is a lack of high-quality studies evaluating inter-rater as well as intrarater reliability. In addition, estimates of reliability measures varied among included studies which limit conclusions that can be drawn. Despite existing heterogeneity, results from meta-analysis indicate moderate-to-substantial inter-rater reliability for the Hawkins-Kennedy Test, Neer Test, Empty Can Test/Supraspinatus Test and the Painful Arc Test. Findings from this systematic review have implications for clinical practice where physical examination manoeuvres are widely used and future studies on diagnostic accuracy and reliability testing. Evaluated physical examination tests needs to be used with great caution in terms of diagnostic value and clinical decision-making.
What are the findings?
This is the first systematic review with meta-analysis of the reliability physical examination tests for the diagnosis of shoulder pathologies.
Estimates of reliability measures varied among included studies which limit conclusions that can be drawn.
Meta-analysis identified extensive heterogeneity among studies for physical examination tests, thus, the findings of the meta-analysis may be inaccurate and need to be interpreted with caution.
Despite existing heterogeneity, results from meta-analysis indicate moderate-to-substantial inter-rater reliability for the Hawkins-Kennedy Test, Neer Test, Empty Can Test/Supraspinatus Test and the Painful Arc Test.
How might it impact on clinical practice in the future?
Several systematic reviews have evaluated the validity of physical examination tests, concluding that most research is of insufficient methodological quality or that consistently solid measures for validity obtained from studies with higher methodological quality are lacking.
Tests with insufficient reliability might be one reason for varying results regarding the validity of physical tests.
The reliability of physical examination test cluster(s) is likely more beneficial for clinical practice in contrast to the evaluation of single tests.
Contributors TL made a substantial contribution to the design of the study; performed the literature search; reviewed the literature; methodologically appraised the articles; extracted, analysed and interpreted the data; produced the figures and graphs; critically revised and wrote the manuscript. OM and NBJ assisted with analysis and interpretation of data; critically revised the article and wrote the manuscript. JS and JL critically commented on the design of the study; and critically revised the manuscript. CK made a substantial contribution to the design of the study; reviewed the literature; methodologically appraised the articles; extracted the data in duplicate; analysed and interpreted the data; and critically revised and wrote the manuscript.
Funding NBJ is supported by funding from National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) 1K23AR059199.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.