Background Surgery for hip femoroacetabular impingement/acetabular labral tear (FAI/ALT) is exponentially increasing despite lacking investigation of the accuracy of various diagnostic measures. Useful clinical utility of these measures is necessary to support diagnostic imaging and subsequent surgical decision-making.
Objective Summarise/evaluate the current diagnostic accuracy of various clinical tests germane to hip FAI/ALT pathology.
Methods A computer-assisted literature search of MEDLINE, CINAHL and EMBASE databases using keywords related to diagnostic accuracy of the hip joint, as well as the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were used for the search and reporting phases of the study. Quality assessment of bias and applicability was conducted using the Quality of Diagnostic Accuracy Studies-2 (QUADAS-2). Random effects models were used to summarise sensitivities (SN), specificities (SP), diagnostic odds ratio (DOR) and respective confidence intervals (CI).
Results The employed search strategy revealed 21 potential articles, with one demonstrating high quality. Nine articles qualified for meta-analysis. The meta-analysis demonstrated that flexion-adduction-internal rotation (pooled SN ranging from 0.94 (95% CI 0.90 to 0.97) to 0.99 (95% CI 0.98 to 1.00); DOR 5.71 (95% CI 0.84 to 38.86) to 7.82 (95% CI 1.06 to 57.84)) and flexion-internal rotation (pooled SN 0.96 (95% CI 0.81 to 0.99); DOR 8.36 (95% CI 0.41 to 171.3) tests possess only screening accuracy.
Conclusions Few hip physical examination tests for diagnosing FAI/ALT have been investigated in enough studies of substantial quality to direct clinical decision-making. Further high-quality studies across a wider spectrum of hip pathology patients are recommended to discern the confirmed clinical utility of these tests.
Trials registration number PROSPERO Registration # CRD42014010144.
- Evidence based review
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Femoroacetabular impingement (FAI), an abnormal articulation and abutment of the femoral head against the acetabulum, is suggested to contribute to acetabular labral tear (ALT) and chondrolabral injury.1 The prevalence of ALT in patients with hip or groin pain ranges from 22% to 55%,2–5 although higher prevalence would be expected in a hip surgeon's consultancy.
Surgery for the correction of FAI has increased significantly over the past decade.6 ,7 In fact, an 18-fold increase between 1999 and 2009 was shown in the USA.8 Given that differential diagnosis for the patient presenting with hip or groin pain is still suggested to be a diagnostic challenge,9 focus on proper diagnosis would seem warranted. A significant increase in attention has been shown for differential diagnosis of the hip joint and periarticular structures as the primary source for hip-related pain/dysfunction,10–14 although the principal focus is on radiographic imaging, and, to a lesser extent, clinical examination.15–17 The diagnostic accuracy of radiographic imaging for FAI/ALT,13 ,14 findings of radiographic asymmetries in asymptomatic individuals,18–44 as well as availability and cost of such imaging merits determination of the clinical utility of the clinical examination for FAI/ALT pathologies.
While the diagnostic accuracy of hip physical examination (HPE) tests for FAI/ALT have previously been described,45–47 the estimation of test probability and, therefore, true clinical utility has not been synthesised. Therefore, the purpose of this study was to conduct a systematic review and meta-analysis of the literature regarding the diagnostic accuracy of HPE tests for FAI/ALT and describe their clinical utility.
The study was registered on 6 August 2014 with the International Prospective Register of Systematic Reviews (PROSPERO# CRD42014010144). PROSPERO is a database of prospectively registered systematic reviews for health and social topics. The study was registered after pilot search and prior to updated data search and extraction.
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were utilised during the search and reporting phase of this review. The PRISMA statement includes a 27-item checklist that is designed to be used as a basis for reporting systematic reviews of randomised trials,48 but the checklist can also be applied to multiple forms of research methodologies.49
A computer-assisted literature search of MEDLINE, CINAHL and EMBASE databases was performed from inception of each respective database to 10 June 2014. The goal was to optimise the sensitivity our search strategy,50 ,51 and to increase the likelihood that all appropriate studies were identified. The search strategy was developed in collaboration with a medical information specialist and used controlled vocabulary and key words related to diagnostic accuracy of the clinical examination measures relative to hip femoroacetabular impingement and labral tear.
Screening filters were initially used during assessment of title, abstract and full text documents. The search was further limited to humans and English-only publications. The search strategy for MEDLINE is listed in online supplementary appendix 1.
Articles examining clinical examination for FAI/ALT pathology were eligible if they met all of the following criteria: (1) included participants with hip pain suspected to be related to hip FAI/ALT, (2) included at least one clinical hip FAI/ALT pathology examination measure, (3) utilised an acceptable reference standard, (4) reported the results in sufficient detail to allow reconstruction of contingency tables to allow calculation of diagnostic accuracy metrics and (5) were written in English.
An article was excluded if: (1) the pathology was associated with a condition located elsewhere (eg, lumbar spine) that referred pain to the hip region, (2) the clinical measures were performed under any form of anaesthesia or on cadavers, (3) specialised instrumentation not readily available to all clinicians were used, (4) physical clinical tests were included as a component of cluster tests, and (5) testing was performed on infants/toddlers.
Two reviewers (MPR, KT) independently performed the search. Since computerised search results for diagnostic accuracy data frequently omit many relevant articles,52 the reference lists of all selected publications were checked to retrieve relevant publications that were not identified in the computerised search. Grey literature was also manually searched and included publications, posters, abstracts or conference proceedings. To identify relevant articles, titles and abstracts of all identified citations were independently screened. Full-text articles were retrieved if the abstract provided insufficient information to establish eligibility or if the article passed the first eligibility screening.
All criteria were independently applied by two reviewers (MPR, KT) to the full text of the articles that passed the first eligibility screening. Disagreements among the reviewers were discussed and resolved by a third reviewer (CEC). We determined which articles to include (for meta-analysis) using clinical and statistical judgement of study heterogeneity. Clinical judgment criteria involved assessment of similarity of populations, assessment context (eg, test performed a priori), study design (eg, case–control vs case based) and method in which specific tests were applied.53 In addition, after approval using clinical judgment, studies were statistically pooled when ≥2 studies examined the same index test and diagnosis with the same reference standard.
Risk of bias/quality assessment
Each of the full text articles was independently reviewed by two reviewers (MPR, CEC) and scored with the Quality Assessment of Diagnostic Accuracy Studies 2 scores (QUADAS-2) tool.54 Disagreements among the reviewers were discussed and resolved during a consensus meeting. The QUADAS-2 is a quality assessment tool comprised of four domains: patient selection, index test, reference standard, and flow and timing. The risk of bias is assessed in each of the domains, while the first three domains are also assessed for applicability by indicating a ‘low’, ‘high’ or ‘unclear’ rating. Applicability in the QUADAS-2 refers to whether certain aspects of an individual study are matching or not matching the review question. Unlike the QUADAS-1, the QUADAS-2 does not utilise a comprehensive quality score, rather an overall judgement of ‘low’, ‘high’ or ‘unclear’ risk. An overall risk rating of ‘low risk of bias’ or ‘low concern regarding applicability’ requires the study to be ranked as ‘low’ on all relevant domains. A ‘high’ or ‘unclear’ rating in one or more domains may require that the study be rated as an ‘at risk of bias’ or having ‘concerns regarding applicability.’
Data extraction and analysis
One reviewer (MPR) independently extracted information and data regarding study population, setting, special test performance, pathology, diagnostic reference-standard and number of true positives, false positives, false negatives and true negatives for calculation of sensitivity (SN), specificity (SP), positive likelihood ratios (+LR) and negative likelihood ratios (−LR) when not provided. SN is defined as the percentage of people who test positive for a specific pathology among a group of people who have the pathology. SP is the percentage of people who test negative for a specific pathology among a group of people who do not have the pathology. A positive likelihood ratio (+LR) is the ratio of a positive test result in people with the pathology to a positive test result in people without the pathology. A +LR identifies the strength of a test in determining the presence of a finding, and is calculated by the formula: SN/(1−SP). A negative likelihood ratio (−LR) is the ratio of a negative test result in people with the pathology to a negative test result in people without the pathology, and is calculated by the formula: (1−SN)/SP. The higher the +LR and lower the −LR the greater the post-test probability is altered. If analysed independently, tests that demonstrate high SN and low −LR are useful in ruling out a condition (screening). In contrast, tests that demonstrate high SP and high +LR assist in ruling in a condition (confirmation).55
It has been suggested that post-test probability can be altered to a minimal degree with +LRs of 1– 2 or −LRs of 0.5–1, to a small degree with +LRs of 2–5 or −LRs of 0.2–0.5, to a moderate degree with +LRs of 5–10 and −LRs of 0.1–0.2) and to a large and almost conclusive degree with +LRs greater than 10 and −LRs less than 0.1.55 Pretest probability is defined as the probability of the target pathology before a diagnostic test result is known. It represents the probability that a specific patient, with a specific history, presenting to a specific clinical setting, with a specific symptom complex, has a specific pathology.55 The diagnostic odds ratio (DOR), defined as +LR/−LR, is a single indicator, independent of prevalence, which represents the ratio of the odds of positivity in disease relative to the odds or positivity in the non-diseased. The values for the DOR range from 0, indicating no test discrimination, to infinity with higher scores indicating better discrimination.56
DerSimonian and Laird57 random effects models, which incorporate both between and within study heterogeneity, were used to produce summary estimates of SN, SP, +LR, −LR, and diagnostic DOR. We initially attempted to analyse these data with a bivariate/hierarchical receiver operator curve model for those diagnostic tests with at least four studies. However, these models failed to converge and therefore report findings from the random effects models. An I-squared value of >50% and Cochrane's-Q p value of <0.10 were the criteria to indicate significant between-study heterogeneity, of SN and SP and likelihood ratios respectfully. Publication bias was not formally tested due to low power of the tests with limited included studies.58 No significant threshold effects were found using Spearman correlation coefficients. When a computational problem of empty cells existed, 0.5 was added to all four cells as suggested by Cox.59 Summary receiver operating characteristic (SROC) curves was produced when ≥4 studies were pooled. These curves summarise the diagnostic performance as a single number; the area under the curve (AUC) with a standard error (SE).60 AUC ≥0.90–1.00 were considered excellent, ≥0.80–90 considered good, ≥0.70–80 considered fair, ≥0.60–0.70 considered poor and ≥0.50–60 fail. Another measure, the Q* index and accompanied SE, is an additional measured produced which is the point on the SROC curve closest to the ideal left top-left corner (where SN and SP meet).61 All analyses were conducted by one of the authors (AG), blinded to results of the search, inclusion/exclusion and study quality, in Meta-DiSc V.1.4.62
Selection of studies
The systematic search through MEDLINE, CINAHL and EMBASE netted 340 abstracts, and 10 additional abstracts were identified through an extensive manual search. In total, 350 titles were initially retained after duplicates were removed. Abstract and full text review reduced the acceptable papers to 21 (figure 1 and table 1).
This review included 1335 participants/1398 hips across 21 studies, investigating 11 different clinical special tests (tables 2 and 3). The sample size of the studies ranged from 1063 to 241 participants (table 2).64 Eight of the studies were retrospective,65–73 nine were prospective5 ,64 ,74–79 and the study design was unclear in four of the studies (table 1).3 ,63 ,80 ,81
Results of individual diagnostic clinical measures
Thirteen studies investigated the FADDIR test,65–69 ,71–73 ,75–77 ,79 ,81 three studies investigated the Flex-IR test,63 ,64 ,74 one study investigated the Bilateral lower extremity squat test,80 three studies investigated the FABER test,77–79 one study investigated the Scour test,78 one study investigated the IR with overpressure test,78 one study investigated the Resisted SLR test,78 one study investigated the Thomas test,3 one study investigated the THIRD test,70 one study investigated the IR-flexion-axial compression test,5 and one study investigated the Trochanteric tenderness test.77 Table 2 reports which tests each study investigated, as well as the characteristics of the study participants and the reference standard utilised. Table 3 reports the tests investigated, their diagnostic accuracy, reference standard utilised and the pretest to post-test probability change after use of the tests in each study for each test investigated.
Table 4 provides diagnostic properties and total sample sizes of the nine studies included in the meta-analysis. Four studies (n=188 participants)65 ,68 ,79 ,81 investigated the use of the FADDIR test with MR angiogram (MRA) reference standard, four studies (n=319 participants)68 ,69 ,71 ,76 investigated the use of the FADDIR test with surgery reference standard, and two studies (n=27 participants)63 ,74 investigated the use of the Flex IR test with surgery reference standard.
Figures 2 and 3 illustrate the SROC curves for FADDIR test with MRA reference standard and surgery reference standard, respectively. This figure illustrates the relationship between SN and 1-SP (false-positive rate) for the included studies in pooled analyses. The AUC was considered fair (AUC=0.76 ((SE=0.19)) for the FADDIR with MRA reference and poor (AUC=0.65 ((SE=1.0)) for FADDIR with surgery reference. The point at which SN and SP were equal (Q*) was 0.70 (SE=0.19) the MRA reference standard and 0.61 (SE=0.81) for surgery reference standard.
Figure 4 illustrates pretest to post-test probability changes utilising the FADDIR test with a surgery reference standard. Pretest to post-test probability for SN remained unchanged. Pretest to post-test probability demonstrated a notable shift, although the CIs were quite large.
Our study examined the current literature investigating clinical examination tests for FAI/ALT. Owing to study heterogeneity and variable reference standards only two of the 11 different tests qualified for meta-analysis (FADDIR and Flexion IR tests). These, as well as most of the other tests, were predominantly only SN and not SP. Only two tests (in 9 studies) qualified for meta-analysis, and none of those tests significantly shifted post-test probability of the diagnosis of FAI/ALT. In fact, none of the tests investigated in our review is capable of significantly altering post-test probability. Additionally, the studies investigating all of the tests in our review, whether performed for the purpose of determining test diagnostic accuracy or not, were at risk for bias and low quality. A significant need for improved quality studies investigating these clinical tests exists.
The purpose of performing clinical testing is to aid with determination of whether a particular pathology exists or not. When the clinician is appraising evidence about diagnostic tests they should consider several key concepts:55 ,82 did the participating individuals present a diagnostic dilemma? how much will different levels of the diagnostic test raise or lower the pretest probability of disease? will the reproducibility of the test result and its interpretation be satisfactory in my clinical setting? are the results applicable to the patients in my practice? will the test results change my management strategy? and will the patients be better off as a result of the test?
Participants presenting to a clinicians practice with a diagnostic dilemma require utilisation of tests with properties capable of differentiating those with versus those without the disease. In order to properly assess these participants it is necessary to determine if they were drawn from a common group in which it is not known whether the condition of interest is present or absent.55 ,82 Participants in all 21 studies of this review were of high suspicion for various types of intra-articular joint pathology, most that were of high suspicion for FAI/ALT due to groin pain and/or other subjective symptoms (eg, clicking, catching) that are highly suggestive of these pathologies.3 ,5 ,68 The pretest probability in the investigated tests ranged from 17%5 to 94%,79 but was higher than 55% in all but five studies5 ,64 ,77 ,78 ,80 involving 10 of the 28 tests listed in table 3. Certainly, a high pretest probability can influence the likelihood of reporting a positive finding on the tests and this may be a reason we found high SN in the tests that were eligible for meta-analysis.
The single largest increase in post-test probability (in studies of low bias) after use of a particular test was only 6.1% when utilising the painful squat test by Ayeni et al80 Most other tests investigated in this review provided very small percentage increases in post-test probability. In fact, in meta-analysis calculations (table 4), there were minimal to no increases in post-test probabilities, ranging from 1% decline in probability when using the FADDIR test (MRA reference standard), no change using the FADDIR test (surgical reference standard; figure 2), and 3% increase using the Flexion IR test with MRA reference standard. While the post-test probability could change significantly with, for example, the FADDIR test to rule out the pathology existing, the wide range CIs limit the accuracy of interpretation of these findings (figure 4). Additionally, having these test performed on participants of high suspicion of pathology results in scarce true negative findings in these studies.
Ascertaining results of the tests investigated in this review strongly suggests limitations in their clinical utility. The results of these tests appear to minimally change the clinician's treatment strategy, and the participants do not appear to be better off as a result of performing these tests. The current scope of literature investigating these tests is narrow in its focus (examining participants with high suspicion of pathology prior to test performance) and, therefore, results in test results of high SN, poor SP and limited post-test probability in determining the diagnosis of FAI/ALT. In fact, a clinician practising in an orthopaedic/sports clinic seeing participants with high suspicion of FAI/ALT would benefit minimally by performing these tests.
An additional concern regarding diagnostic accuracy studies is the potential for bias. Only one of the included studies in this review had low risk of bias.80 This study investigated a newly described test (bilateral lower extremity squat to maximum depth). Using a sample size of 76 participants, a SN of 75 and a SP of 41, this test was found to also provide the greatest shift in pretest to post-test probability (6.1%). As mentioned previously, the large majority of these tests had much greater pretest probabilities prior to test implementation than this test. This study utilised the more potentially biased reference standard of MRI and/or MRA.
Although MRI/MRA is rightfully not considered the ‘gold standard’ for these studies, the use of surgery alone leads to biased participant sampling. Determination of the clinical utility of these tests requires their implementation in future studies across the spectrum of participants with and without undetermined hip pathology. The reference standard for these studies would thus be the more imprecise MRI/MRA. Although the diagnostic accuracy of MRI/MRA currently is limited,13 limiting image reading to those with specialist training,83 utilisation of precise, tailored protocols84 and improving imaging technology could afford suitability for MRI/MRA as acceptable reference standards for these participants. Implementation of these tests across this broad continuum of hip participants is paramount to determine their actual clinical utility.
Limitations of this study include: limiting the search strategy to only those articles written in English, one author pulling the data points, lack of comparison of participant inclusion and exclusion across the studies and studies performed in settings of high pretest probability. Limiting to English language could potentially miss some studies to be included in our review. Only one author pulling the data points increases the risk of potential error, although this author (MPR) has pulled many of these data points for a similar review of both intra-articular and extra-articular hip pathology.85 Since many of the included studies were published prior to prospective guidelines for diagnostic accuracy studies, the robustness of reporting inclusion/exclusion criteria was highly variable among the studies and very difficult to comprehensively describe in this review. Having these studies performed in hip surgeon clinical settings is a limitation of the current literature that is uncontrollable by the authors of this review.
Owing to the low quality and biased sampling of patients with high probability of disease, hip physical examination tests do not appear to currently provide the clinician any significant value in altering probability of disease with their use. Currently, only the FADDIR and Flex-IR tests are supported by the data as valuable screening tests for FAI/LT pathology. Further studies involving high quality designs across a wider spectrum of hip pathology patients are necessary to discern the confirmed clinical utility of these tests.
What are the new findings
Few hip clinical tests, as investigated in current studies, actually make a significant change in post-test probability for the potential of femoroacetabular impingement/acetabular labral tear (FAI/ALT) existing.
These clinical tests, when utilised in clinical settings of high pretest probability, will provide the practising clinician with limited to no assistance in determination of the presence or absence of FAI/ALT.
Prospective studies examining clinical utility of these tests in patients with/without various hip pathologies are not available, but suggested.
The authors would like to thank Leila Ledbetter, MLIS for assisting with the literature search for this study.
Contributors MPR developed the idea of manuscript. All the authors contributed in writing, editing and approval of final manuscript.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.