Article Text

Diagnostic accuracy of clinical tests of the hip: a systematic review with meta-analysis
  1. Michael P Reiman1,
  2. Adam P Goode1,
  3. Eric J Hegedus2,
  4. Chad E Cook3,
  5. Alexis A Wright2
  1. 1Community and Family Practice, Duke University School of Medicine, Durham, North Carolina, USA
  2. 2Physical Therapy, High Point University, High Point, North Carolina, USA
  3. 3Physical Therapy, Walsh University, North Canton, Ohio, USA
  1. Correspondence to Michael P Reiman, Duke University School of Medicine, Community and Family Practice, 2200 W. Main, Durham, North Carolina 27705, USA; michael.reiman{at}


Background Hip Physical Examination (HPE) tests have long been used to diagnose a myriad of intra-and extra-articular pathologies of the hip joint. Useful clinical utility is necessary to support diagnostic imaging and subsequent surgical decision making.

Objective Summarise and evaluate the current research and utility on the diagnostic accuracy of HPE tests for the hip joint germane to sports related injuries and pathology.

Methods A computer-assisted literature search of MEDLINE, CINHAL and EMBASE databases (January 1966 to January 2012) using keywords related to diagnostic accuracy of the hip joint. This systematic review with meta-analysis utilised the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for the search and reporting phases of the study. Der-Simonian and Laird random effects models were used to summarise sensitivities (SN), specificities (SP), likelihood ratios and diagnostic OR.

Results The employed search strategy revealed 25 potential articles, with 10 demonstrating high quality. Fourteen articles qualified for meta-analysis. The meta-analysis demonstrated that most tests possess weak diagnostic properties with the exception of the patellar-pubic percussion test, which had excellent pooled SN 95 (95% CI 92 to 97%) and good specificity 86 (95% CI 78 to 92%).

Conclusion Several studies have investigated pathology in the hip. Few of the current studies are of substantial quality to dictate clinical decision-making. Currently, only the patellar-pubic percussion test is supported by the data as a stand-alone HPE test. Further studies involving high quality designs are needed to fully assess the value of HPE tests for patients with intra- and extra-articular hip dysfunction.

Statistics from


Sports related hip injuries are common among athletes of all ages who participate in sports that are associated with trauma or repetitive strains at the hip joint.1 ,2 Hip specific injuries may include sports hernia,3 labral tears,4 pathological fractures,4 avascular necrosis4 and trochanteric pain syndrome.5 Associated disorders that the athlete may present for medical assessment/evaluation are and femoroacetabular impingement syndrome6 and osteoarthritis.7

With the evolution of improved diagnostic imaging and advanced surgical techniques, examination of the coxofemoral (hip) joint and periarticular structures as a primary pain source for hip related pain/dysfunction has received a significant increase in attention.8 Despite the increased attention, differential diagnosis of the hip joint continues to pose a diagnostic dilemma, particularly given pain in the hip region is often difficult to localise to a specific pathological structure.9 ,10 Although limited information exists in support of diagnostic utility, emphasis on patient history, clinical examination findings, MRI, arthrogram and anaesthetic intra-articular injection pain response is currently advocated for determining the presence of intra-articular hip joint pathology.11 Combined use of these examination processes by healthcare practitioners has been only marginally effective. Authors have reported that patients visit, on average, 3.3 healthcare providers before being correctly diagnosed with a hip labral tear over a period of 21 months.12

Further complicating the diagnostic challenge for hip joint pathologies is the complex regional anatomy and biomechanics of the hip joint.13 The concept of overlapping and multifarious referral pain patterns, along with the recognition of the mechanical relationship between the hip and spine,14,,21 has led to various differential diagnostic pathways.9 ,2225

In an attempt to improve the level of consensus among specialists treating hip pathology, a common language and protocol, although specific to labral pathology,9 has been described, including commonly used hip physical examination (HPE) tests.26 Lacking in this detailed description of the hip clinical examination is the diagnostic accuracy and subsequent quality of such tests. To our knowledge, a comprehensive review of the diagnostic accuracy of HPE tests does not exist, specifically for sports related hip injuries. Consequently, the purpose of this study was to conduct a systematic review and meta-analysis of the literature regarding the diagnostic accuracy of HPE tests relevant to patients injured in sport related activities and with diagnoses affiliated with intra- and extra-articular pathology of the hip with appropriate criterion reference standards in cohort, case control and/or cross sectional design studies.


Study design

The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were utilised during the search and reporting phase of this systematic review/meta-analysis. The PRISMA statement includes a 27-item checklist that is designed to be used as a basis for reporting systematic review of randomised trials.27 The PRISMA checklist and flow diagram were created to be used prospectively during the creation of systematic reviews and meta-analyses and was used as such in this systematic review.

Search strategy

Identification and selection of the literature

A systematic, computerised search of the literature in MEDLINE, CINAHL and EMBASE databases (the search strategy is shown in Appendix I) was concluded in January 2012. The reference lists of all selected publications were checked to retrieve relevant publications that were not identified in the computerised search. Grey literature was also hand searched by one of the authors (MPR) and included publications, posters, abstracts, or conference proceedings. To identify relevant articles, titles and abstracts of all identified citations were independently screened by 2 reviewers (AAW and EJH). Full-text articles were retrieved if the abstract provided insufficient information to establish eligibility or if the article had passed the first eligibility screening.

Selection criteria

All articles examining HPE tests specific to the hip were eligible for this study. A HPE test was operationally defined as a stand-alone, clinical test (eg, special test) representative of a pathological condition of the hip joint. An article was further eligible if it met all of the following criteria: 1) patients presented with hip or groin pain, 2) a cohort, case control, and/or cross sectional design was used, 3) inclusion of at least one clinical examination test used to evaluate intra-or extra-articular pathology, 4) was compared against an acceptable criterion reference, 5) and reporting of diagnostic accuracy of the measures (eg, sensitivity (SN) and specificity (SP), or was present and 6) the article was in English.

An article was excluded if 1) the pathology was associated with a condition that was isolated elsewhere (eg, lumbar spine) but referred pain to the hip, 2) the studies omitted values of either SN or SP, 3) if the clinical examination test was performed under any form of anaesthesia or in cadavers, 4) if those studies that used instrumentation that was not readily available to all clinicians, 5) only individual physical clinical tests were included and 6) if studies were performed on infants/toddlers.

All criteria were independently applied by 2 reviewers (AAW and EJH) to the full text of the articles that passed the first eligibility screening. In case of disagreement, a third author (MPR) was consulted to discuss and solve the disagreement.

Quality assessment

The original Quality Assessment of Diagnostic Accuracy Studies tool (QUADAS 1) was used to analyse the quality of the study (Appendix II). QUADAS 1 consists of 14 items with each having a ‘yes/no/unclear’ answer option. A ‘yes’ score indicated sufficient information, with bias considered unlikely. A ‘no’ score indicated sufficient information, but with potential bias from inadequate design or conduct. An ‘unclear’ score indicated that insufficient information was provided in the article or the methodology was unclear. The total score was the count of all of the criteria that scored ‘yes’, which was valued as ‘1’ whereas, ‘no’ and ‘unclear’ scores carried a zero score value. The maximum attainable score on the criteria list was 14. The methodological quality of each of the studies was independently assessed by two additional reviewers (MPR and CEC). Disagreements among the reviewers were discussed and resolved with consensus. Inter-rater reliability was configured with weighted κ. Qualitatively, studies that exhibit higher QUADAS values are associated with less risk of design bias than those of lower values. We stratified studies as high quality/low risk of bias if the QUADAS score was 10 or greater, and low quality/high risk of bias of the study score less than 10 on the QUADAS since this dichotomising stratification level had been utilised previously.28

A second iteration of QUADAS (QUADAS 2) was introduced in 2011.29 The QUADAS 2 involves qualitative scoring and at present has yielded poor agreement among tool users. The tool primarily is used to determine whether there are risks of bias within a study and whether the applicability of the study is appropriate. We opted not to use the QUADAS 2 since we could not yield a total score and since the interrater reliability of the tool appears to be questionable.

Data abstraction

One reviewer (MPR) independently extracted information and data regarding study population, setting, HPE test performance, hip pathology, diagnostic reference-standard, number of true positives, false positives, false negatives and true negatives for meta-analysis. Sensitivity, SP, positive likelihood ratio (LR+), and negative likelihood ratio (LR-) were also calculated and/or reported to determine clinical utility of the HPE tests. Sensitivity is defined as the percentage of people who test positive for a specific disease among a group of people who have the disease. Specificity is the percentage of people who test negative for a specific disease among a group of people who do not have the diagnosis/disorder. A LR+ is the ratio of a positive test result in people with the pathology to a positive test result in people without the pathology. A LR+ identifies the strength of a test in determining the presence of a finding, and is calculated by the formula: SN/(1-SP). A LR- is the ratio of a negative test result in people with the pathology to a negative test result in people without the pathology, and is calculated by the formula: (1-SN)/SP. The higher the LR+ and lower the LR- the greater the post-test probability is altered. Post-test probability can be altered to a minimal degree (LR+'s of 1 to 2, or LR-‘s of .5 to 1), to a small degree (LR+’s of 2 to 5 and LR-‘s of .2 to .5), to a moderated degree (LR+’s of 5 to 10, LR-‘s of .1 to .2) and to a significant and almost conclusive degree (LR+’s greater than 10, LR-‘s less than 0.1).30


Studies were explored for statistical pooling where ≥ 2 studies examined the same index test and diagnosis with the same reference standard. Der-Simionian and Laird31 random effects models, which incorporate both between and within study heterogeneity into summary estimates, were used to produce summary estimates of SN, SP, LR+, LR-, and diagnostic OR (DOR) for those studies with the same reference standard. When appropriate (ie, >2 homogenous studies), the joint distribution of SN and SP were analysed with the Moses-Shapiro-Littenberg linear model methods to draw sROC and calculate the area under the curve (AUC) and Q*as measures of test accuracy.32 Heterogeneity was tested with χ2 tests and Cochrane-Q to estimate between-study heterogeneity, of SN and SP and likelihood ratios respectfully, with values of p<0.10 indicating significant heterogeneity. Publication bias was not formally tested due to limitations of the tests with less than 10 studies.33 No significant threshold effects were found using Spearman correlation coefficients. Cell counts of zero are common in diagnostic accuracy studies and when cell counts of zero were encountered, 0.5 was added to all four cells as suggested by Cox.34 All analyses were conducted by one of the authors (APG) in Meta-DiSc version 1.4.35


Selection of studies

A computerised search, along with reference checking, yielded a total of 25 studies for inclusion in the review (figure 1). A total of 127 out of 152 articles were excluded for not reporting of both SN and SP. Multiple studies were performed on cohorts of subjects with defined or suspected pathology, limiting reporting of SP. Out of the 25 total studies investigated, 20 studies examined intra-articular and/or fracture pathologies (two studies for hip osteoarthritis, 12 studies for impingement/labral tear/intra-articular pathology, five studies for fracture of the hip or femur and one study for avascular necrosis), while five studies investigated extra-articular pathologies (three studies for gluteal tendinopathy, and only one each for the diagnoses of sports related chronic groin pain and leg complaints in endurance athletes due to vascular causes.

Figure 1

Diagram of study flow.

Quality scores

The weighted κ between testers for the overall score using QUADAS was 0.68 (95% CI 0.31to 0.73). For the individual items of the QUADAS, items 2, 5, 6, 7, 8, 9, 11, and 14 had 100% agreement, items 3 and 12 had 93% agreement, items 1 and 4 had 90% agreement, item 10 had 87% agreement, and item 13 had 78% agreement between raters. κ values in the range of 0.41 to 0.60, 0.61 to 0.80, and 0.81 to 1.00 are labelled as strength of agreement as ‘moderate’, ‘substantial’, and ‘almost perfect’ respectively.36 Using our arbitrary stratification of the QUADAS, the assessment of the 25 articles retained for this review indicated that 10 articles were of high quality/low risk of bias and the remaining 15 articles had QUADAS scores below 10 out of 14 points, yielding low quality/high risk of bias (tables 1–7).

Table 1

Summary of articles reporting on the diagnostic accuracy of OSTs for pathologies of the hip: hip osteoarthritis

Results of individual diagnostic clinical tests

The definition of each HPE test was variable among the studies. Therefore, in order to allow the clinician and researcher the ability to compare HPE diagnostic values, each HPE was grouped according to how it was performed. All similarly performed HPE tests were then grouped and compared statistically when appropriate. Additionally, the reliability and diagnostic accuracy of each test is listed to allow the reader to discern their clinical applicability.

Intra-articular pathology and/or fracture

Hip osteoarthritis

Two studies met inclusion criteria for the diagnosis of hip osteoarthritis (table 1).37 ,38 Trendelenburg's test, resisted hip abduction,37 and FABER's test38 were investigated and were considered high quality studies. Of the three tests the resisted hip abduction test yielded small post-test probability influence (LR+ 3.5), the highest of the three.

Impingement/labral/intra-articular pathology

Twelve studies met inclusion criteria for diagnosis of impingement/labral/intra-articular pathology (table 2).11 ,39,,49 The FABER test was utilised in three studies.11 ,39 ,40 The SN values for this test ranged from 42 to 81%, while the SP values ranged from 18 to 75%. Maslowski et al39 was the only study to investigate the scour, internal rotation with overpressure, and resisted straight leg raise tests. Six studies investigated the FADDIR test.11 ,4044 All of these articles were available for meta-analysis. The SN values for this test ranged from 59 to 100%, and the SP values ranged from 4 to 75%. The various pathologies captured in these studies are described in table 2. Leunig et al42 was the only study investigating the impingement provocation test. Three studies investigated the flexion-internal rotation test.45,,47 All three studies were available for meta-analysis. The SN values for these tests ranged from 94 to 98%, while the SP values ranged from 8 to 25%. One study comprised of 18 subjects with labral tear and various other pathologies investigated the internal rotation-flexion-axial compression test.48 One study of 59 subjects with various intra-articular pathologies (labral tear, loose bodies, chondral defect and arthritic changes) investigated the Thomas test.49 All of the studies in this category were of low quality/high risk of bias with the exception of one high quality/low risk of bias study.49 In general, these tests demonstrated greater SN than SP. The Thomas Test49 demonstrated value as both a screen and diagnostic test (SN 89%; SP 92%) with a LR+ of 11.1, indicating the ability to alter post-test probability significantly.

Table 2

Summary of articles reporting on the diagnostic accuracy of OSTs for pathologies of the hip: impingement/labral/intra-articular tests

Fracture of hip or femur

Five studies met inclusion criteria for diagnosis of fracture of the hip or femur (table 3).50,,54 Only one of the five studies was of high quality according to our definition. This study investigated the patellar-pubic percussion test (PPPT)52 as did two other studies.50 ,51 All three studies found that the PPPT moderately influenced post-test probability as a stand-alone test with LR+ values ranging from 5.1 to 20.4, and LR- ranging from 0.06 to 0.75. The remaining two studies investigated the stress fracture (fulcrum) test.53 ,54 All three studies were included in meta-analysis.

Table 3

Summary of articles reporting on the diagnostic accuracy of OSTs for pathologies of the hip: fracture of hip or femur

Avascular necrosis

Only one study examined the diagnostic accuracy of HPE tests for avascular necrosis of the hip and was limited to one high quality study on 176 subjects infected with HIV (table 4).55 The SN (range of 13 to 88%) and SP (range of 34 to 92%) was variable depending on the test, but none of the findings reported LR+ ratio greater than 2.38 (extension < 15 degrees) or a LR- less than 0.35 (exam complex) indicating that at best, there are only small alterations in post-test probability for avascular necrosis (table 4).

Table 4

Summary of articles reporting on the diagnostic accuracy of OSTs for pathologies of the hip: avascular necrosis

Extra-articular pathology

Gluteal tendinopathy

Three studies met inclusion criteria for diagnosis of gluteal tendinopathy (table 5).56,,58 All of the studies were of high quality/low risk of bias by our definition despite presenting a maximum of 40 subjects enrolled. Both passive (SN range of 43 to 53%; SP of 86%) and active hip internal rotation (SN of 31 and SP of 86%) appeared to be more specific than sensitive for assessment of gluteal tendinopathy. The resisted external derotation test58 demonstrates high SN and SP of 88 and 97.3%, respectively.

Table 5

Summary of articles reporting on the diagnostic accuracy of OSTs for pathologies of the hip: gluteal tendinopathy

Sports related chronic groin pain

Only one study met inclusion criteria for diagnosis of sports related chronic groin pain investigating the single adductor test, squeeze test, and bilateral adductor test (table 6).59 The bilateral adductor test was found to be most diagnostic of sports related chronic groin pain with reported SP of 93% with a LR+ of 7.7, whereas the squeeze test and single adductor test reported SP of 91% and LR+ of 4.8 and 3.3 respectively.59

Table 6

Summary of articles reporting on the diagnostic accuracy of OSTs for pathologies of the hip: sports related chronic groin pain

Leg complaints in endurance athletes due to vascular causes

Only one study met inclusion criteria for assessment of leg complaints in endurance athletes due to vascular causes (table 7).60 This study was of low quality/high risk of bias as per our a priori QUADAS criteria. Assessment for a femoral bruit with the hip extended was found to be most diagnostic of leg complaints in these athletes with a SP of 94% and LR+ of 6.0. Assessment of a femoral bruit with the hip flexed, and the SI joint gapping test appear to be valuable as SN tests.

Table 7

Summary of articles reporting on the diagnostic accuracy of OSTs for pathologies of the hip: Leg Complaints in Endurance Athletes due to Vascular Causes


Table 8 provides diagnostic properties and total sample sizes of the 14 studies included in the meta-analysis. One study43 investigated the FADDIR test with both MRA and arthroscopy as a reference standard. Two other studies56 ,58 investigated both the Trendelenburg and resisted hip abduction test. Figures 2 and 3 display the ROC Curve and Forest Plot for the PPPT. Of the studies selected for inclusion in the systematic review, five tests met eligibility for analysis of heterogeneity and potential pooling of data: FADDIR for labral tear, Flexion IR for labral tear, and PPPT for femoral fracture and Trendelenburg and resisted hip abduction both for gluteal tendinopathy. A total of six articles were available for meta-analysis addressing the FADDIR test for labral tear.11 4044 A total of three articles each per test were available for meta-analysis addressing the flexion internal rotation test for labral tear45,,47 and the PPPT for femoral fracture.50,,52

Figure 2

Pooled sensitivity,specificity,negative liklihood ratio, positive likelihood ratio and diagnostic OR with 95% CI for the patella percussion test.

Figure 3

Summary symmetrical receiver operating curve (sROC) for the three studies for patella precussion test. Area under the curve (AUC) and Q* and their SE are provided as measures of test accuracy.

Table 8

Pooled diagnostic properties and for the diagnosis of labral tear, femoral fracture and gluteal tendinopathy

Only the FADDIR test and Flexion IR test were available for meta-analysis relative to diagnosis of hip labral tear. The FADDIR test was analysed separately for the reference standards of MRA and arthroscopy. One study used both MRA and arthroscopy as a reference standard; therefore the analysis was performed separately for each.43 One study used injection response as the reference standard for labral tear and impingement and was not included in either analysis.11 The summary SN was excellent for both the FADDIR test 99 (95% CI 95 to 100) and the Flexion IR test 96 (95% CI 82 to 100) whereas the SP of both tests was poor. The DORs for both test were weak and contained the null value; therefore unlikely to provide any diagnostic discriminative ability for labral tear. Data were extracted from three of the studies for the FABERs test for intra-articular pathology; however, all three used a different criterion standard (MRA, injection, pain improvement) and were therefore not eligible for further analyses.

For the PPPT, the meta-analytic summary estimates of this test were good-to-excellent for SN 95% (95% CI 92 to 97%), SP 86% (95% CI 78 to 92%) and the DOR 96.42 (95% CI 36.34 to 255.87). The AUC (0.97) and Q* statistic (0.92) both indicate excellent accuracy of the PPPT. Data were also extracted for two studies for the fulcrum test for stress fracture; however, both of these studies used different reference standards (ie, radiography, bone scan or MRI) and were therefore not eligible for further summary analyses.

The meta-analytic summary estimates for the Trendelenburg test were good for SN 61:95% CI 46 to 75%), good-to-excellent for SP 92% (95% CI 83 to 97%) and the DOR 26.46 (95% CI 1.92 to 365.23), while those for the resisted hip abduction test were good for SN 71:95% (CI 51 to 87%), SP 84% (95% CI 71 to 93%) and the DOR 13.24 (95% CI 0.36 to 484.13).


Our study reviewed a broad spectrum of HPE tests designed to detect intra and extra-articular hip pathology. Similar to what others have found in the spine,61 shoulder28 and knee,62 the majority of HPE tests for the hip are deficient for contributing to post-test probability for a dedicated hip specific diagnoses. The majority of stand-alone HPE tests do not demonstrate high levels of SN and/or SP,10 ,28 ,62 thus questioning their clinical utility as stand-alone tests. Only five tests (in 14 studies) were included in the meta-analysis, and of those the PPPT was the only one to significantly alter post-test probability. Indeed, additional high quality studies are needed before most HPE tests can be recommended as stand-alone diagnostic tests.

There are a number of reasons for the poor clinical utility of the majority of the HPE tests. First, the overlap in signs, symptoms and pathomechanics between many intra-articular pathologies,63,,66 as well as changes associated with disease progression,65 may lead to misdiagnoses.12 Second, the majority of the HPE tests for the intra-articular pathologies of hip OA, labral tear and avascular necrosis were highly SN with poor SP. Consequently, a positive test means very little in the diagnosis of a condition such as a labral pathology since the same tests will also be positive in other disparate conditions. Only the Thomas test was found to substantially improve probability for diagnosis of labral tears. One of the primary predisposing factors to labral tear is femoroacetabular impingement,63 ,64 which involves abutment between the femoral head and acetabular rim in an adducted and internally rotated position of the hip.63 ,64 Although the Thomas test does not reproduce this position, or account for the other primary etiological factors (capsular laxity, dysplasia, trauma, or degeneration)63 ,64 it does recreate hip extension, which has been shown to recreate the greatest forces on the hip joint.67 Additional support for utilisation of the Thomas test for labral tear testing is the fact that the majority of labral tears are in the anterior portion of the hip joint.68 Last, there is a risk that the lower quality of studies fails to fully discriminate the true utility of the HPE tests and further studies, that exhibit higher quality, may yield findings that are notably different than those identified in this review.

Of the studies examining diagnostic accuracy of HPE tests for gluteal tendinopathy, only the resisted external derotation test demonstrated the ability to modify the post-test probability of a gluteal tendinopathy diagnosis. Clinical features of gluteal tendinopathy include pain reproduction with passive elongation of the involved tendons, as well as active contraction of these same tendons.58 This test replicated both of these clinical features, therefore likely improving its diagnostic accuracy. Many study design related shortcomings limit the interpretability and generalisabity of these tests. A prime example is the previously described Lequesne et al article.58 Despite the low bias found on the QUADAS, the sample size included only 17 patients limiting the external generalisability.

The HPE test demonstrating the strongest diagnostic value was the PPPT which yielded both excellent SN and SP in three different studies. Previous investigation has also supported the tuning fork and stethoscope as a valid measure of fracture assessment in multiple other bones.69 This finding was particularly the case in transverse fractures, which likely have sufficient space created by the fracture to decrease the sound the tuning fork produces.69 The strong diagnostic value of this test provides the clinician with an increased sense of its clinical utility, especially in light of ease of application, access to radiology, and cost-effectiveness in prescription of radiology.

Worth noting is the dichotomy between a quality study and the results of a study. The two must be considered together when advocating the use of HPE tests for clinical practice. In our study, we used the original QUADAS tool (QUADAS 1) and a cut off of 10 or higher to define high quality. Others28 ,61 ,62 ,70 have used this value to define higher quality studies. The majority of studies scored below 10 and had notable deficiencies associated with bias, that were captured by the QUADAS 1 tool. Despite the fact that many of the studies included scored as ‘high quality/low risk of bias’ in this review, there are several notable factors which may question the quality of these studies. Many of the studies had very low sample sizes and compared against control groups that were of healthy individuals, all elements which are known to influence study bias,71 and which are not part of the QUADAS 1 tool. Although we are uncertain if the QUADAS 2 tool improves the capacity of identifying biases beyond QUADAS 1 it is worth noting that a better measure of study bias toward overall outcome is needed as is a mechanism to better define high versus low quality.

Another consideration is the limited capacity of a single test for making a definitive clinical decision.61 Clustering tests does appear to provide more promising findings and the process of clustering to produce a preponderance of evidence of the existence of a hip specific diagnosis is more closely associated to actual clinical examination. One such study, eliminated from this review with our exclusion criteria, was the clinical prediction rule for detecting hip OA.38 This study appears to be the only one of any clinical value for diagnosing hip OA. Future studies should investigate the diagnostic capabilities of these stand-alone tests when used in clusters.

Finally, it is important to consider the accepted criterion standard used in diagnostic accuracy studies from which to compare HPE tests. Not all criterion standards utilised in these studies were of equal value. Magnetic resonance arthrography is suggested as the preferred imaging criterion reference for labral tear pathology.10 ,72 CT and MRI are therefore less ideal imaging standards. The use as intra-articular injections has also been suggested as the less preferable alternative.11 With respect to hip OA, plain film radiographs, are confirmatory for moderate to advanced hip OA. Plain radiographs are less useful in demonstrating early OA joint changes.73 Additionally, to the authors knowledge, the diagnostic accuracy of hip OA findings on radiographs has not been determined. In light of this uncertainty, and inconsistent use of true criterion standards, judgment of the clinical utility of each of the HPE tests investigated in this study therefore requires careful consideration.

Therefore, the extreme number of imperfect and occasionally disparate reference standards existing across many of the hip-related diagnoses is of diagnostic importance because the use of different reference standards has also been recognised as a large source of bias and can lead to widely different diagnostic accuracy values across studies74 ,75 and may trend toward underestimating the value of a test.76 The hip is laden with symptom-based diagnoses. Symptom-based diagnoses may look markedly different from person to person and are based on a collection of symptoms versus a true, well understood biological cause. It is our impression that FAI, sports related chronic groin pain, and OA all fall within this categorisation. This dilemma, which is not unique to the hip, creates controversy and discourse among research clinicians and is typically adjusted through selected meta-analytic statistical methods.76 ,77 The models used in this meta-analysis, Der-Siminon and Laird random effects models31 take into account heterogeneity produced from using different reference standards. If there were any differences between the estimates from different reference standards, we analysed them separately.


This review is not without limitations. One limitation of this study is our use of stratified QUADAS scores to organise study quality. Although many studies have used QUADAS summary scores,28 ,61 ,62 ,70 others have cautioned against the use of a dedicated quality score,75 As noted, it is very likely that some of the scoring of studies that were ranked as high quality were likely inflated because QUADAS 1 does not have a quality item score for sample size, case control designs, or other areas which may greatly inflate or poorly represent true study quality. Our meta-analytical results for likelihood ratios and DORs were not statistically significant for most diagnostic tests and some had significant heterogeneity. The small number of studies and number of subjects within these studies is one possible reason for the lack of statistical significance and decreased precision of the pooled estimates. An additional limitation is the lack of comparison of patient inclusion and exclusion criteria across the studies. Because many of the studies are older and were published before prospective guidelines for diagnostic accuracy the robustness of reporting inclusion/exclusion criteria was highly variable among studies and very difficult to comprehensively describe in this review. Lastly, only one author pulled data points, which increases the risk of potential error.


In terms of individual HPE tests of the hip, the PPPT demonstrated strong diagnostic accuracy properties for ruling in/ruling out femur fracture; the resisted external derotation test shows promise in diagnosis of a gluteal tendinopathy and the Thomas test shows intriguing findings with respect to a labral tear. Caution should be used since a single HEP test may not yield diagnostic findings that are compelling.

What is already known on this topic

  • Hip physical examination (HPE) tests, in combination with imaging and a detailed clinical history are routinely used in clinical practice to detect various intra-and extra-articular hip pathologies.

  • Hip joint clinical examination has traditionally focused on variable versions of HPE tests.

  • Variable criterion reference standards have been utilised when investigating various hip pathologies, especially labral tear/impingement and fractures.

  • Various levels of diagnostic accuracy have been reported for HPE tests.

What this study adds

  • There is limited evidence to support the use of HPE tests as stand-alone clinical tests for the diagnosis of hip related pathology.

  • Meta-analysis for the patellar-pubic percussion test shows that this test has discriminatory ability for diagnosis of femur fracture.

  • Other individual HPE tests investigated in this study have limited discriminatory ability to diagnose the variable other hip pathologies due to various reasons including limited study quality and limitations in criterion reference standards in some cases.

Recommendations are as follows

  • The patellar-pubic percussion test is a useful diagnostic test for hip fracture.

  • Clinicians should not rely on a single HPE test when diagnosing other hip joint pathologies.


View Abstract


  • Funding Dr. Goode receives funding from the NIH Loan Repayment Program, National Institute of Arthritis Musculoskeletal and Skin Diseases (1-L30-AR057661-01) and is supported by Agency for Health Care Research and Quality (AHRQ) K-12 Comparative Effectiveness Career Development Award grant number HS19479-01. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the AHRQ or NIAMS.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • ▸ References to this paper are available online at

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.