Article Text

Validity, reliability and responsiveness of patient-reported outcome questionnaires when assessing hip and groin disability: a systematic review
  1. K Thorborg1,
  2. EM Roos2,
  3. EM Bartels3,
  4. J Petersen1,
  5. P Hölmich1
  1. 1Department of Orthopaedic Surgery, Amager Hospital, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
  2. 2Institute of Sports Science and Clinical Biomechanics, University of Southern Denmark, Odense, Denmark
  3. 3Copenhagen University Library and The Parker Institute, Copenhagen, Denmark
  1. Correspondence to Kristian Thorborg, Department of Orthopaedic Surgery, Amager Hospital, Faculty of Health Sciences, University of Copenhagen, Italiensvej 1, 2300 Copenhagen S, Denmark; kristian.thorborg{at}amh.regionh.dk

Abstract

Background Novel treatment interventions are advancing rapidly in the management of hip and groin disability in the physically active young to middle-aged population.

Objective To recommend the most suitable patient-reported outcome (PRO) questionnaires for the assessment of hip and groin disability based on a systematic review of evidence of validity, reliability and responsiveness of these instruments.

Methods MEDLINE, EMBASE, CINAHL, Cochrane Central Register of Controlled Trials, PsycINFO, SportsDiscus and Web of Science were all searched up to January 2009. Two reviewers independently rated measurement properties of the PRO questionnaires in the included studies, according to a standardised criteria list.

Results The computerised search identified 2737 publications. Forty-one publications investigating measurement properties of PRO questionnaires assessing hip or groin disability were included in the study. Twelve different questionnaires designed for patients with hip disability and one questionnaire for patients with groin disability were identified. Hip dysfunction and Osteoarthritis Outcome Score (HOOS) contains adequate measurement qualities to evaluate patients with hip osteoarthritis (OA) or total hip replacement (THR). Hip Outcome Score (HOS) is the best available questionnaire for evaluating hip arthroscopy, but the Inguinal Pain Questionnaire, the only identified questionnaire evaluating groin disability, does not contain adequate measurement qualities.

Conclusions HOOS is recommended for evaluating patients with hip OA undergoing non-surgical treatment and surgical interventions such as THR. HOS is recommended for evaluating patients undergoing hip arthroscopy. Current and new PRO questionnaires should also be evaluated in younger patients (age <50) with hip and/or groin disability, including surgical and non-surgical patients.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Hip and groin disability is a common health problem1 2 affecting physical function and health-related quality of life.2,,4

Novel treatment interventions, such as hip arthroscopy, endoscopic groin hernia repair and specific exercise regimens, are advancing rapidly in the management of hip and groin disability,5,,7 giving better possibility of offering optimal treatment if assessed correctly.

At present, there is a general consensus that patient-reported outcomes (PROs) should serve as a gold standard in the assessment of musculoskeletal conditions, where the patient's perspective and health-related quality of life are of main interest.8 Different PRO questionnaires exist and are used interchangeably when assessing patients with hip and groin disability. However, knowledge of which PRO questionnaires should be recommended when evaluating interventions in clinical cohorts, randomised controlled trials and clinical databases is lacking.

Prior to recommending or discarding specific PRO questionnaires, a systematic investigation of their descriptive qualities and psychometric properties is required. Since 2004, systematic reviews on the psychometric properties of specific questionnaires assessing outcome in different anatomical regions and conditions have been published.9,,12 Such reviews are a prerequisite for evidence-based healthcare,13 which has the potential of improving treatment of hip and groin disability and, in doing so, introducing considerably social and economic benefits in the future.14

The aim of this study was to provide recommendations for choice of PRO questionnaire when assessing patients with hip and/or groin disability in studies or clinical databases concerning outcomes of various types of surgical, medical or exercise treatments. The method chosen was to carry out a systematic review of studies evaluating psychometric properties of PRO questionnaires for these patients.

Methods

We performed a systematic review of the literature concerning assessment of hip and/or groin disability (1) to identify PRO questionnaires for patients with hip and/or groin disability and (2) to evaluate the psychometric properties of these outcome measures.

The groin is anatomically located in the anterior-medial part of the hip region, and the hip and groin region share vascular and neural supply.15 The pathologies of the hip joint and the groin often present simultaneously, and the symptoms can be overlapping.16,,19 We, therefore, chose to search for PRO questionnaires concerning both regions.

Definitions

Psychometric properties

Psychometrics is the discipline concerned with the measurement of variables in tests and questionnaires and has more recently been introduced in health-related fields.20 Psychometric properties in this study was defined as measurement properties of questionnaires concerning validity, reliability and responsiveness.

Psychometric theory

Classical Test Theory (CTT) and Item Response Theory (IRT) are different expressions of psychometric theory. CTT predicts outcomes of testing such as the degree of difficulty of items being tested or the ability of the persons being tested. CTT assumes that an observed score may be decomposed into a “true” score and an “error” score and that the reliability coefficient can be expressed as the ratio of true variance to (true+error) variance. The term “classical” is seen in contrast to the more recent psychometric theories such as IRT. IRT has also been used to develop and internally validate measures. IRT assumes that the test scale is unidimensional and creates an interval-scaled measure.20

Patient-reported outcome

A PRO is any report coming directly from patients about a health condition and its treatment.8 21 PRO questionnaires include items, instructions and guidelines for scoring and interpretation and are used to measure these patient reports.8

Disability

Disability in this study encompasses the health dimensions within the methodological framework of The International Classification of Functioning, Disability and Health (ICF) as categorised in one of three levels: impairment (body structure and function), disabilities (activities) and participation problems (participation).22

Search strategy

We searched the bibliographic databases MEDLINE via PubMed (from 1945 to January 2009), EMBASE via OVID (from 1980 to January 2009), CINAHL via Ebesco (from 1982 to January 2009), Cochrane Central Register of Controlled Trials (up to January 2009), PsycINFO via OVID (from 1806/1987 to January 2009), SportsDiscus (up to January 2009) and Web of Science (from 1900 to January 2009). Our search strategy was:

  • Hip or groin or inguinal hernia

  • and

  • Outcome assessment* OR self assessment* OR questionnaire*

  • and

  • Reliability or validity

The terms were searched as key words, in MEDLINE named MESH terms, in other databases key words, where possible and also as “free-text” words appearing anywhere in the reference fields. From the retrieved and selected references, reference lists were checked for further relevant studies. Finally, specific searches for identified questionnaires were carried out, and experts in the field were contacted for possible additional references.

Study selection

Two reviewers (KT and EMB) independently carried out the selection of possible studies for inclusion from the retrieved references, based on titles and abstracts. All possible eligible studies were obtained in full and evaluated based on the inclusion criteria. Excluded studies were identified and presented with the reasons for exclusion (figure 1).23 24 The exclusion sequence in figure 1 was chosen to decrease time expenditure, and criteria that were directly assessable from title, abstract or methods were evaluated first, while criteria that needed scrutinising the paper were chosen as second exclusion criteria.

Figure 1

Selection of publications for the systematic review.

Inclusion criteria

We included studies which fulfilled the following criteria:

  1. The retrieved study was published in English, German or French, as a full report.

  2. Psychometric properties in the study were evaluated with CTT.20

  3. The main purpose of the study was to evaluate one or more psychometric properties of a PRO questionnaire, including patients with hip and/or groin disability.

  4. The study included a PRO questionnaire specifically concerning hip or/and groin disability, containing items related to impairment (body structure and function), disabilities (activities) or participation problems (participation), according to ICF.22

  5. Data on hip and/or groin disability could be separated from disabilities of other anatomical regions.

Characteristics of studies and instruments

The descriptive data in each study had to provide information on psychometric properties evaluated in the study, time of administration, target population (diagnosis/clinical features), study population and mode of administration. Extracted information from the identified questionnaires included full name of the questionnaire, abbreviation of the name of the questionnaire, assessment dimensions and number of rating scales.

Data extraction and evaluation of psychometric properties

Based upon the guidelines for systematic reviews,24 we used a criteria list for evaluative purposes and explicitly described its operationalisation. The criteria list in question was recently published by Terwee et al25 and is suited to give information on PRO questionnaires and their psychometric properties, where group comparisons are needed. This criteria list has recently been applied in other systematic reviews,9,,12, and we considered it the best available instrument for our purpose. Methodological issues of the criteria list were discussed and refined in the study group, which is in accordance with recommendations in the original article.25 The original criteria list by Terwee et al did not include inter-tester reliability,25 but we decided to add the evaluation of inter-tester reliability, since some of the included studies in the present review used observer administration and assessed the inter-tester reliability of this procedure in their study.

The present criteria list evaluated the psychometric properties: content validity, internal consistency, construct validity, floor and ceiling effects, test–retest reliability, inter-tester reliability, agreement, responsiveness and interpretability. Inter-tester reliability was included in the overall quality evaluation only for PRO questionnaires where observer administration was introduced.

The psychometric properties were rated as positive (+), indeterminate (±), negative (−) or no information available (?) (see online Appendix 1). In order to avoid systematic errors in the study design or execution, two reviewers (KT and JP) independently rated the psychometric properties of each questionnaire according to the criteria list. Uncertainty or disagreement was resolved by discussion with a third reviewer (EMB). Where further information of the studies was needed, the authors of these studies were contacted for clarification. The ratings of the questionnaires in the individual studies can be found online in Appendix 2. A table that provides an overview of these ratings is composed and presented, in accordance with the recommendations by Terwee et al.25

Statistical analysis of the reliability of the ratings

In the present study, un-weighted κ statistics were used to calculate the inter-tester reliability of the initial ratings by the two reviewers, since the ratings are considered nominal.26

Results

The total search identified 2737 publications. Following the screening of titles and abstracts, 2628 publications were excluded. Out of the remaining 109 publications, which were read in full, 68 publications were excluded since they did not fulfil our predefined inclusion criteria (figure 1), leaving 41 studies, involving 12 779 patients, as our final data for reviewing (table 1).

Table 1

Description of included studies in the systematic review

In three situations, we found publications containing information on psychometric properties of PRO questionnaires based upon the evaluation of the same group of patients: (1) refs 27 and 28; (2) refs 29–31; and (3) refs 32–34. These may, therefore, be considered as one study. We did not exclude any of these part-studies, since each part included different measurement aspects and/or results.

A total of 13 PRO questionnaires were identified in the included studies (table 2). Twelve PRO questionnaires considered the hip region, and one questionnaire considered the groin region. The PRO questionnaires were assessed in three main target populations: total hip replacement (THR), hip osteoarthritis (OA), and various forms of hip and groin pain or dysfunction (see online Appendix 2).

Table 2

Included PRO questionnaires for patients with hip and/or groin disability

The inter-tester reliability of the independent ratings based upon the criteria list was good (k=0.79, CI 95% 0.73 to 0.84).26 Disagreement was mainly caused by reading errors where one of the reviewers had overlooked specific information on a specific psychometric property. Uncertainty or disagreement only had to be resolved by discussion with the third reviewer on two occasions, regarding internal consistency and agreement. The ratings of the questionnaires in the individual studies can be found online in Appendix 2. The ratings of the included questionnaires are synthesised in a summary and presented in table 3.

Table 3

Summary of the quality assessment of the included questionnaire

Description of included studies and identified questionnaires

Table 1 presents a detailed description of the included studies. Table 2 provides a description of the 13 identified questionnaires. Eight questionnaires were designed for evaluation of hip OA and/or THR. Four questionnaires were developed for evaluating hip arthroscopy and/or hip disability in general, and one questionnaire was developed for evaluating groin-hernia repair.

Content validity

Content validity was defined as the extent to which the domain of interest is comprehensively sampled by the items in the questionnaire.25 Hip dysfunction and Osteoarthritis Outcome Score (HOOS), Non-arthritic Hip Score (NHS), Oxford Hip Score (OHS) and Patient Specific Index (PASI) were developed involving target population and investigator/experts in the item generation process.29 35,,37 Hip Outcome Score (HOS) and Total Hip Arthroplasty Outcome Questionnaire (THAOQ) were developed without involving the target population in the item generation process.38 39 For the remaining questionnaires, no information was found on content validity.

Internal consistency

Internally consistency is the extent to which items in a (sub)scale are inter-correlated and is a measure of the homogeneity of a (sub)scale.25 The dimensional structure of only two questionnaires was studied by factor analysis. Exploratory factor analysis was carried out by Martin et al38 in the development of HOS and by Dawson et al40 in the evaluation of Lequesne Index of Severity for Osteoarthritis of the Hip (LISH). Martin et al38 did not perform factor analysis on all items together but only on predefined subscales. Dawson et al40 carried out exploratory factor analysis on all items together, which showed that LISH loaded on two factors, indicating the existence of two subscales, one regarding pain and discomfort and one regarding function and mobility, but they did not calculate Cronbach's α for each of these dimensions. Principal component analysis was performed by Nilsdotter et al in the evaluation of HOOS, but Cronbach's α was not analysed.41 Information on internal consistency was found for nine questionnaires, and Cronbach's α was most often the only analysis of the dimensional structure.

Construct validity

Construct validity is the extent to which scores on questionnaire relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the domains that are measured.25 Construct validity was studied by correlating the score of the questionnaire with other disability questionnaires, with the physical dimension of general health instruments or with other hip disability questionnaires. Twelve studies presented a priori hypotheses and showed results in support of these.28 29 34 38 40,,47

Floor and ceiling effects

Floor and ceiling effects are present if the questionnaire fails to demonstrate a worse score in the patients who clinically deteriorated and an improved score in patients who are clinically improved.25 Three questionnaires showed floor or ceiling effect. Four studies48,,51 assessed floor (worst possible score) and ceiling effects (best possible score) of the Western Ontario and McMaster Universities Osteoarthritis index (WOMAC) questionnaire, showing ceiling effects postoperatively for patients who in all studies had undergone THR.48,,51 Ostendorf et al48 and Naal et al46 found no floor or ceiling effects of OHS preoperatively and postoperatively, but Garbuz et al49 found ceiling effects of OHS postoperatively following THR. de Groot et al43 found no floor and ceiling effects of the HOOS in patients with hip OA and following THR, but Nilsdotter et al41 found ceiling effects of the HOOS postoperatively following THR.

Test–retest reliability

Test–retest reliability is the extent to which the same results are obtained on repeated administrations of the same questionnaire when no change in physical function has occurred.25 Information on test–retest reliability was found for 10 questionnaires. Intraclass correlation coefficient (ICC) was reported, and test–retest was ≥0.70 for the HOS, HOOS, Hip Rating Questionnaire (HRQ), LISH, OHS, PASI and WOMAC.27 28 32 33 35 43 44 46 47 52,,56 However, two studies28 33 showed an ICC <0.70 for the stiffness domain in the WOMAC but above 0.70 for pain and physical function.

Inter-tester reliability

Inter-tester reliability is the extent to which the same results are obtained on repeated administrations of the same questionnaire by different observers when no change in physical function has occurred. Inter-tester reliability was assessed in the American Academy of Orthopaedic Surgeons Hip Score (AAOS-HS), LISH, PASI and WOMAC. Inter-tester reliability was above 0.70 for the AAOS-HS and PASI28 57 but below 0.70 for the stiffness domain of WOMAC.28

Agreement

Agreement is the ability to produce exactly the same scores with repeated measurements.25 Information on agreement was only found in five studies.35 46 51,,53 For HOS,52 the minimal important change (MIC) was above the smallest detectable change (SDC), showing adequate agreement. In the study by Quintana et al,51 the MIC at 6 months postoperatively was at least 25 points for the three WOMAC domains pain, stiffness and physical function. SDC was 27.98 points for the stiffness domain, indicating that measurement error for the stiffness domain is slightly above the MIC,51 affecting agreement.

Responsiveness

Responsiveness was defined as the ability to detect important change over time in the concept being measured.25 In the study by Martin et al,52 the SDC was below the MIC, and area under the curve was above 0.70 for both subscales of the HOS. The SDC for each subscale of HOOS could be calculated on the basis of the standard error of measurement related to the test–retest reliability. Effect size and standard response mean were presented evaluating HRQ, LISH, OHS, PASI and WOMAC. However, this does not provide information on the ability of the instrument to detect important changes over time, with measurement error subtracted, in the concept being measured.

Interpretability

Interpretability is the degree to which one can assign qualitative meaning to quantitative scores.25 Several studies presented mean and SD scores of at least two relevant subgroups. MIC was only presented in two studies, concerning the HOS and WOMAC.51 52 Information on scores on different hip disability groups was available for HOOS.43

Detailed information on psychometric properties of the questionnaires evaluated in the individual studies can be found online in Appendix 2. The summarised evaluation presented in table 3 is based upon these ratings.

Overall quality

Table 3 shows the quality assessment of the 13 questionnaires, summarising each item as positive, indeterminate or negative. An empty spot indicates no information available on this aspect. Inter-tester reliability is not relevant when the PRO is a self-administered questionnaire, and, therefore, this aspect was only considered for questionnaires that were administered by an observer.

Since the results are dependent on the population being studied, the specific populations are presented. The populations being studied are THR, hip OA, patients following hip arthroscopy, patients following groin-hernia repair and patients with unspecified hip pain. Overall, the HOS and the HOOS received the best ratings for their psychometric properties (six positive scores out of eight).

Discussion

In the present study, we systematically reviewed the literature concerning PRO questionnaires assessing hip and groin disability, and evaluated their psychometric properties. To our knowledge, this is the first systematic review evaluating PRO questionnaires used in the assessment of patients with hip and/or groin disability. We identified 12 PRO questionnaires applied in the assessment of patients with hip disability and one PRO questionnaire for patients with groin disability. Nine of the questionnaires for patients with hip disability were assessing disability of hip OA and/or patients waiting for, or who had undergone THR, and the questionnaires showed remarkable similarities regarding content and purpose.

A systematic review from 2008 showed a considerable variation in the types of outcome measures applied in randomised trials of hip replacement since 2000.58 The reason for this could be explained by our finding that a large number of PRO questionnaires exist and are applied in the assessment of THR, creating more confusion than consensus in the area.

Psychometric properties of a PRO questionnaire should only be related to the specific target population and the context in which it has been applied.9 We have, therefore, chosen to report the psychometric properties of a questionnaire for individual target populations (table 3). Detailed information of the target population and the context in which the questionnaires are applied (table 1) is just as important as the rating of the psychometric properties in the questionnaire when selecting the most appropriate instrument for a specific purpose.

The present study shows that HOOS has adequate psychometric properties when assessing patients with hip OA undergoing conservative treatment or THR. HOOS is a self-administered PRO questionnaire containing adequate measurement qualities for test–retest reliability, floor and ceiling effects, construct validity, responsiveness and interpretability, including patients with hip OA and THR43 and should be considered for this purpose. Furthermore, content validity has been adequately assessed including experts/investigators and patients with hip disability.35 Another advantage of HOOS when assessing patients with hip OA or THR is that it includes specific dimensions concerning function, sport/recreational activity and quality of life. Today, many patients with hip OA and THR are active in sports,59 and, therefore, the ability to perform sport and recreational activities is an important factor contributing to their well-being and health-related quality of life.

The psychometric properties of LISH and WOMAC have been evaluated in several studies. LISH, in its present form, seems to consist of two subdimensions, pain/discomfort and function/mobility.40 A composite score, therefore, seems problematic, since deterioration or improvement of these health-related aspects cannot be distinguished when choosing an aggregate score. The present study shows that the stiffness domain of the WOMAC had a test–retest reliability ICC below 0.70 for OA and THR patients28 33 and an inter-tester reliability ICC for THR patients preoperatively and postoperatively below 0.70,28 suggesting a considerable variation of measurements in this domain, in the included studies. In the study by Quintana et al,51 the SDC for the stiffness domain was above MIC at 6 months postoperatively after THR, indicating that measurement error for the stiffness domain is slightly above MIC,51 which may negatively influence the agreement and responsiveness of this domain. The stiffness domain is only based on two questions in the WOMAC questionnaire, and the reliability of the domain may, therefore, be more easily affected than the reliability of the pain or functional limitation domain, which is in accordance with previous findings.60

Three questionnaires showed ceiling effects (best possible score) following THR postoperatively. WOMAC showed in all studies ceiling effects in several patients following THR. Ceiling effects was also found in OHS in one study49 and in HOOS postoperatively in one study.41 Postoperatively, ceiling effects can either be instrument dependent or intervention dependent. An instrument with good content validity ought only to display ceiling effects, when individual disability is non-existing. For the young and physically active patient after THR, a best possible score in the WOMAC domains does not necessarily indicate satisfactory health status, since possibly important and more demanding activities such as sport and recreational activities have not been evaluated. However, for the old and less physically active patient after THR, a best possible score in the WOMAC domains may indicate a very satisfactory health status and an important change caused by the intervention.

Earlier reviews have recommended WOMAC and LISH in the evaluation of hip and/or knee OA, in studies looking at effects of treatment.12 61 Our results are not in complete accordance with this due to the following: our study is restricted to look at data concerning patients with hip and/or groin disability only, where previous studies and reviews were concerned with OA of the hip and knee and included studies where hip and knee OA could not be separated.12 61 We excluded studies where data on hip and knee OA could not be separated, since we were not interested in the questionnaires' ability to measure OA in general. Our goal was to get information on the questionnaires' ability to measure clinical changes in patients with different forms of hip and groin disability. For our group of patients, WOMAC and LISH did not appear to be quite as promising, while the present study reports excellent properties in favour of HOOS.

The OHS and the PASI were promising when evaluating effects of THR, showing adequate test–retest reliability for the self-administered version of PASI and OHS, and adequate inter-tester reliability for the observer-administered version of PASI. More information on important aspects such as internal consistency, floor and ceiling effects and responsiveness, is however needed.

Only HOS, Modified Harris Hip Score (MHHS) and NHS have been evaluated in a younger group of patients (mean age <50 years) with hip and/or groin disability, and for the other PRO questionnaires, the measurement properties are unknown for this group. The present study shows that HOS has adequate psychometric properties when assessing young patients (mean <50 years) undergoing hip arthroscopy. HOS is a self-administered PRO questionnaire containing adequate measurement qualities for test–retest reliability, floor and ceiling effects, construct validity, agreement, responsiveness and interpretability, including patients undergoing hip arthroscopy, and should be considered for this purpose.

HOS and PASI involve patient-specific options, such as individual patient formulation of relevant questions in PASI,55 and the use of non-applicable boxes for irrelevant questions in HOS.38 Although patient-specific response options offer the advantage of identifying patient-relevant issues, they are not yet universally accepted by researchers.62 The lack of standardisation of the items under study means that the scales cannot be considered the same in each patient, and the numeric score may not hold a common meaning.62 Furthermore, the value of analysing the data statistically and calculating parameters such as means and correlations is questionable, and researchers should consider this issue before implementing these instruments.

Harris Hip Score (HHS) is the most widely used instrument when assessing hip disability.58 63 We only included the MHHS, and not the original HHS, since this instrument cannot be considered a true PRO questionnaire as it is a composite score that combines patient-reported information and physical assessment performed by an observer. The MHHS only contains the patient-reported part of the HHS and is currently and widely used for assessing young and active patients undergoing hip arthroscopy.64,,67 The present study shows that the psychometric properties of the MHHS have not been adequately assessed in the young population and we cannot recommend the use of the MHHS in studies assessing younger and active patients undergoing hip arthroscopy.

Our study shows clearly that valid, reliable and responsive PRO questionnaires, assessing groin disability, are lacking in general. Groin disability is a common problem in young and physically active people,3 17 19 and HOS and HOOS address dimensions that are relevant to younger and physically active people such as those engaging in sports. However, HOOS and HOS do not include groin-related questions, only questions related to the hip. This is problematic since young patients often report groin symptoms16 17 19 and often do not describe their symptoms as being located to the hip. It, therefore, seems of great importance to try to incorporate groin-related items in a new PRO questionnaire aimed at young and physically active people (mean <50 years). For a large category of these patients, conservative treatment seems to be an effective intervention,68 and a new PRO questionnaire must, therefore, also include non-surgical patients in the development of a reliable, valid and responsive instrument for the assessment of treatment outcome and changes over time.

The criteria list used in our study was developed to evaluate psychometric properties of PRO questionnaires based on CTT.20 IRT is a relatively new method to evaluate questionnaires in healthcare and has some potential advantages over CTT.20 70 The Rasch model, a mathematical model applied in IRT, has been used to develop and internally validate measures, and it uses a logistic function that creates an interval-scaled measure.20 71 In the studies by Martin et al38 and Dawson et al,40 the Rasch method was applied in combination with CTT in the development and internal validation of HOS38 and LISH.40 Our criteria list was only developed to evaluate psychometric properties of questionnaires based upon CTT, and this is a limitation of our study but a limitation we could not avoid with the present available data. In the future, criteria that evaluate methods and results of studies using IRT models must be developed,25 since this method has gained acceptance and studies on developing and/or evaluating questionnaires based on IRT are now more frequent.

Another limitation of our study is that no gold standard exists to evaluate psychometric properties of PRO questionnaires, and our chosen criteria list may, therefore, be disputed. There are other criteria lists available,71 72 but none of these have such detailed criteria for adequate measurement properties as the criteria list published by Terwee et al.25 The inter-tester reliability of the independent ratings based upon the criteria list was good,26 which is in accordance with a previous finding.11

Several systematic reviews, evaluating the efficacy of different treatment modalities for patients with hip and/or groin disability, exist.5 67 73 None of these consider the quality of the outcome measures applied in the included studies. Earlier, reviews were mainly concerned with obvious methodological qualities such as randomisation procedures, control groups, blinding, compliance, drop-out, intention to treat etc.74 Measurement properties have rarely been evaluated in the same methodologically stringent manner.74 A risk of bias may, therefore, have been introduced with the possibility of unqualified instruments being selected when investigating and reporting the efficacy of different treatment modalities. With our study, a step has been taken in the direction of a more stringent documentation of psychometric properties of PRO questionnaires for patients with hip and/or groin disability.

Conclusion

Based on the results of our present study, we recommend HOOS for evaluating patients with hip OA undergoing non-surgical treatment and surgical interventions such as THR. HOS is recommended for evaluating patients undergoing hip arthroscopy. Current and new PRO questionnaires should also be evaluated in younger patients (age <50) with hip and/or groin disability, including surgical and non-surgical patients.

What is already known on this topic

There is a general consensus that patient-reported outcome (PRO) should serve as a gold standard in the assessment of musculoskeletal conditions, where the patient's perspective and health-related quality of life are of main interest. Today, different PRO questionnaires exist and are used interchangeably in the assessment of hip and groin disability, but no recommendations or consensus exists on which questionnaires to prefer.

What this study adds

Many PRO questionnaires used in the assessment of hip and groin disability are insufficiently developed. However, HOOS can be recommended for evaluating patients with hip osteoarthritis undergoing non-surgical treatment and surgical interventions such as total hip replacement. HOS can be recommended for evaluating patients undergoing hip arthroscopy. Furthermore, this study shows that a new PRO questionnaire focusing on the evaluation of hip and groin disability in young and physically active patients is needed.

Acknowledgments

This study was supported by grants from Danish Regions, The Association of Danish Physiotherapist and the Lundbeck Foundation.

References

Supplementary materials

Footnotes

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.