Article Text

This article has a correction. Please see:

Download PDFPDF

The Copenhagen Hip and Groin Outcome Score (HAGOS): development and validation according to the COSMIN checklist
  1. K Thorborg1,
  2. P Hölmich1,
  3. R Christensen2,3,
  4. J Petersen1,
  5. E M Roos2
  1. 1Arthroscopic Centre Amager, Amager Hospital, University of Copenhagen, Copenhagen, Denmark
  2. 2Research Unit for Musculoskeletal Function and Physiotherapy, Institute of Sports Science and Clinical Biomechanics, University of Southern Denmark, Odense, Denmark
  3. 3The Parker Institute: Musculoskeletal Statistics Unit, Copenhagen University Hospital, Frederiksberg, Copenhagen, Denmark
  1. Correspondence to Kristian Thorborg, Faculty of Health Sciences, Department of Orthopaedic Surgery, University of Copenhagen, DK-2300 Copenhagen S, Denmark; kristianthorborg{at}hotmail.com

Abstract

Background Valid, reliable and responsive Patient-Reported Outcome (PRO) questionnaires for young to middle-aged, physically active individuals with hip and groin pain are lacking.

Objective To develop and validate a new PRO in accordance with the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) recommendations for use in young to middle-aged, physically active patients with long-standing hip and/or groin pain.

Methods Preliminary patient interviews (content validity) included 25 patients. Validity, reliability and responsiveness were evaluated in a clinical study including 101 physically active patients (50 women); mean age 36 years, range 18–63 years.

Results The Copenhagen Hip and Groin Outcome Score (HAGOS) consists of six separate subscales assessing Pain, Symptoms, Physical function in daily living, Physical function in Sport and Recreation, Participation in Physical Activities and hip and/or groin-related Quality of Life (QOL). Test–retest reliability was substantial, with intraclass correlation coefficients ranging from 0.82 to 0.91 for the six subscales. The smallest detectable change ranged from 17.7 to 33.8 points at the individual level and from 2.7 to 5.2 points at the group level for the different subscales. Construct validity and responsiveness were confirmed with statistically significant correlation coefficients (0.37–0.73, p < 0.01) for convergent construct validity and for responsiveness from 0.56 to 0.69, p < 0.01.

Conclusion HAGOS has adequate measurement qualities for the assessment of symptoms, activity limitations, participation restrictions and QOL in physically active, young to middle-aged patients with long-standing hip and/or groin pain and is recommended for use in interventions where the patient's perspective and health-related QOL are of primary interest.

Trial registration ClinicalTrials.gov NCT00716729

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Pain in the hip and groin region is a common musculoskeletal complaint in the young to middle-aged population1 affecting physical function and health-related quality of life (QOL).2 Furthermore, hip and groin pain can be a long-standing condition, being difficult to fully recover from.3 4 Musculoskeletal disorders such as long-standing hip and groin complaints, therefore, have a large impact on healthcare expenditure, sick leave and work disability,5 resulting in substantial social and economic costs.6

Novel treatment methods, such as hip arthroscopy, incipient groin hernia repair, ultrasound-guided corticosteroid injections and specific exercise regimens, are advancing rapidly in the management of young and middle-aged physically active patients with hip and groin pain.7,,15 There is a general consensus that Patient-Reported Outcomes (PROs) should serve as the gold standard in the assessment of musculoskeletal conditions, where the patient's perspective and health-related QOL are of primary interest.16,,19 However, valid, reliable and responsive PRO questionnaires for physically active patients with long-standing hip and/or groin pain are lacking.20 The need for reliable and valid instruments is emphasised in a study by Marshall et al,21 who demonstrated that clinical trials using unpublished measurement instruments were more likely to report positive effects of treatment than clinical trials using published instruments. Therefore, in order to properly evaluate the large spectrum of treatment strategies and regimens for young to middle-aged physically active patients with hip and groin pain, a valid, reliable and responsive PRO questionnaire is needed.20

In a recent international consensus process, including leading experts in the fields of psychology, epidemiology, statistics and clinical medicine from all over the world, a consensus on the taxonomy, terminology and definitions of measurement properties for health-related PROs was reached22 and formulated in a COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist.23

The objective of this study was to develop and validate a new PRO questionnaire aimed at young to middle-aged physically active people with long-standing hip and/or groin pain by following the COSMIN recommendations on terminology and research agenda.22 23

Methods

Development of the questionnaire

The methodological framework for developing and evaluating a PRO questionnaire included the following steps: (1) identification of a specific patient population, (2) item generation, (3) item reduction and (4) determination of the validity, reliability and responsiveness. Steps 1 and 2 involved developing a preliminary version of the questionnaire, which is described in the Methods section. Step 3 involved testing the individual items and subscales of the preliminary version by analysing patient responses. Based upon these analyses, a final version of the questionnaire was decided upon. Step 4 involved testing the final version of the questionnaire for validity, reliability and responsiveness. Steps 3 and 4 are described in the Results section. A flowchart of the complete study process is shown in figure 1.

Figure 1

Flowchart of the study process.

Population identification

The goal of this instrument is to evaluate hip and/or groin disability related to impairment (body structure and function), activity (activity limitations) and participation (participation restrictions) according to the International Classification of Functioning, disability and health (ICF),24 in young to middle-aged physically active patients with hip and/or groin pain. Disability in this study encompasses the health dimensions within the methodological framework of ICF as categorised in one of three levels: impairment (body structure and function), activity limitations (activities) and participation restrictions (participation).24 The objective would be to achieve a quantitative measure of the patient's hip and groin disability according to the different levels of the ICF. The measure should reflect the patient's perception of his/her disability as well as his/her actual disability. Physically active patients refer to any patient who is physically active at least 2.5 h a week.25

The groin is anatomically located in the anterior-medial part of the hip region, and the hip and groin region share vascular and neural supply.26 The pathologies of the hip joint and the groin often present simultaneously and the symptoms can be overlapping.27,,30 This makes the hip and groin a complex anatomical region where validated diagnostic tools for differentiation of musculoskeletal diagnoses are lacking.31,,34 We, therefore, chose not to restrict our measurement instrument to be evaluated in a patient group with a specific diagnosis, but instead we wanted to focus on the commonalities of hip and/groin pain in physically active patients.

The patient flow is presented in figure 2. Patients with hip and/or groin pain, from primary and secondary care, who were at least 18 years of age, were recruited from January 2009 to February 2010. Patients were screened by a specialist (orthopaedic surgeon or sports physiotherapist) within the area of musculoskeletal examination of hip and/or groin pain in younger physically active patients. If the specialist suspected that hip and/groin pain was not of musculoskeletal origin, the patient was referred for further investigation and was not invited to participate in the study. All other patients presenting with hip and/or groin pain were considered eligible for the study and were invited to participate. These patients were informed about the purpose of the research by the people responsible for the study, and written consent was obtained from those who agreed to participate. A self-reported questionnaire was used to screen for inclusion and exclusion of the patients who agreed to participate in the study. Patients seeking medical care presenting with hip and/or groin pain were included if they fulfilled all the following criteria: (1) had received treatment for their hip and/or groin pain, (2) were restricted in their activities due to hip and/or groin pain, (3) had hip and/or groin pain in the previous 14 days, (4) had hip and/or groin pain of more than 6 weeks' duration, (5) had hip and/or groin pain located in one of five predefined regions in a pain drawing (region 3, 6, 7, 8 or 9, figure 3) and (6) were physically active for at least 2.5 h per week. Patients with self-reported limiting comorbidities35 were excluded from the study. The pain drawing (figure 3) was adapted from methods for determining location of pain used in previous studies,36 37 and pain of more than 6 weeks' duration has previously been defined as long-standing in nature concerning the population under study.9

Figure 2

Clinical study profile.

Figure 3

Pain drawing showing percentages of included patients (n = 101) indicating pain in 15 predefined regions at baseline.

Item generation

The item generation phase included the following steps: a systematic review of the literature,20 a focus group involving experts and individual patient interviews. The systematic review identified existing PROs that showed adequate measurement qualities or promise concerning validity, reliability and responsiveness when assessing patients with hip and/or groin disability.20 The Hip disability and Osteoarthritis Outcome Score (HOOS) and the Hip Outcome Score (HOS) were found to be promising tools for patients with hip and/or groin disability; however, the HOOS questionnaire had only been validated in patients with hip osteoarthritis or following total hip replacement, and the HOS in patients following hip arthroscopy. Therefore, the items were not necessarily addressing our target group of young to middle-aged physically active patients with hip and/or groin pain.20

The HOOS was chosen as a template for the development of a new PRO questionnaire because HOOS consists of items and subscales related to body structure and function, activity and participation according to the ICF classification. It shows excellent measurement qualities in patients with hip disability for all dimensions. HOOS consists of five subscales: Pain, Symptoms, Function in daily living (ADL), Sport and Recreation function (Sport/Rec) and hip-related QOL.38 Furthermore, HOOS includes a format that is user friendly, self-explanatory and is already adopted in hip rehabilitation research worldwide.20 We, therefore, decided to translate and cross-culturally adapt the HOOS from the original Swedish version to a Danish version according to existing guidelines39 40 in a process that included 24 patients with hip disability.41 We then incorporated and adapted three items that seemed relevant from the HOS – Sports subscale that were not present in HOOS.42,,44 The items from the HOS were named SP7, SP9 and SP10 (table 2).

Table 2

Preliminary items and subscales in HAGOS

Groin problems are common in physically active people and HOOS and HOS address dimensions, such as sport, that are relevant to young to middle-aged physically active people.20 However, HOOS and HOS do not include groin-related questions, only questions related to the hip. This is problematic because young to middle-aged physically active patients often report groin symptoms27 28 30 and often do not describe their symptoms as being located in the hip.20 All questions in the new outcome questionnaire were therefore rephrased so that they referred to the term ‘hip and/or groin’, instead of the term ‘hip’ alone, to improve the face validity of the questionnaire. We found this appropriate based on the existing data that have shown that patients with hip and groin pathology often report symptoms that do not seem to be restricted to one of these anatomical regions,27 28 30 recognising that these regions have never been precisely defined anatomically, and therefore merely reflect individual and cultural beliefs.37 By using the term ‘hip and/or groin’, we believe that the questionnaire covers a body region that also refers to the frontal and medial part of the hip region (the groin) that patients often refer to as a separate region.37 The new questionnaire was therefore named the Copenhagen Hip and Groin Outcome Score, abbreviated to HAGOS (appendices 1 and 2).

Expert focus group

The second step involved interviewing experts in the field. Three doctors (two orthopaedic surgeons and one physician) and four physiotherapists (four sports physiotherapists, one also being a musculoskeletal physiotherapist) with extensive experience and special expertise in treating physically active patients with hip and/or groin pain were interviewed. The experts underwent a semi-structured interview in which they were asked to fill out the preliminary version while commenting on issues related to questions they felt were missing, the questionnaire's readability and its ease of comprehension. The purpose of the interview was to identify relevant items that were missing and to improve the readability and comprehension of the questionnaire.

The experts commented that the introductory information on the questionnaire, where patients were asked to report disability related to the previous week, was problematic. The experts stated that many patients with hip and groin disability have had the problem for a long time and due to their disability, may not have performed these activities at all during the previous week, and therefore would not be able to answer this question in a valid way. It was therefore decided to add the following introductory information: If an item does not pertain to you or you have not experienced it in the past week please make your ‘best guess’ as to which response would be the most accurate. This solution has previously been used in the format of The Western Ontario Rotator Cuff Index and the Western Ontario Instability Score.45 46 Because the current outcome questionnaire is not only a measure of actual disability but also perceived disability, we found this solution appropriate. Based upon the focus group involving the experts, item S1 from the original HOOS38 was divided into S1 and S2 as discomfort and clicking were considered to be different symptomatic aspects. Furthermore, six items, named P12, P13, SP5, SP6, Q4 and Q5, were added after suggestions by the experts (table 2).

Patient interviews

The final step in the item generation process was to interview patients with hip and/or groin disability individually. Individual patients were specifically chosen for an interview so that there would be representation of sex, age, type of injury, time from initial injury and severity of symptoms. The preliminary questionnaire was piloted on patients until data saturation was achieved. The patients underwent a semi-structured interview in which they were asked to fill out the preliminary version while commenting on issues related to questions they felt were missing, the questionnaire readability and its ease of comprehension. This process included 25 patients, 12 men and 13 women (34 ± 11 years) recruited from the Artroscopic Centre Amager, Amager Hospital. Twenty patients were interviewed individually before data saturation was achieved and two items were added, P2 and SP8 (table 2). Furthermore, several patients mentioned that they did not understand the meaning of Q3 from the original HOOS: How much are you troubled with lack of confidence in your hip?38 Even though the main purpose of this process was not to omit items, we decided that the item had to be removed because too many patients did not understand the meaning of the question. This new preliminary version was piloted on five patients and did not require further modification. The preliminary questionnaire consisted, after item generation, of 52 items in five subscales (Symptoms (7), Pain (13), ADL (17), Sport/Rec (10) and QOL (5)).

Methodological testing and evaluation of measurement qualities of the new patient-reported questionnaire using the COSMIN checklist

Internal consistency

Internal consistency is the degree of interrelatedness among the items.47 A principal component factor analysis was performed on the individual subscales to assess their structural validity. Failure to load on a single major factor suggests that the items do not all measure the same construct. Cronbach's α was calculated per subscale and a score above 0.70 was taken as an indication of sufficient homogeneity of the items in the subscale.48 49

Test–retest reliability

Test–retest reliability is the extent to which scores for the same patients are unchanged for repeated measurements over time.47 Intraclass correlation coefficients (ICCs) were reported and test–retest ICC should be ≥0.70 for all subscales.48 49 Test–retest reliability was evaluated after 1–3 weeks in 44 stable patients. This time interval between test and retest was chosen because we believe it is long enough to prevent recall of previous answers, though short enough to assume that the condition in most cases will not change.49 Patients reported at the retest whether their hip and/or groin pain was ‘better’, ‘not changed’ or ‘worse’ since the initial test. Patients reporting scores as ‘unchanged’ were considered stable and included in test–retest reliability analysis.22 23

Measurement error

Measurement error is the systematic and random error of a patient's score that is not attributed to true changes in the construct to be measured.47 The smallest detectable change (SDC), which is the threshold for determining clinical changes beyond measurement error, was calculated on the basis of the SEM of the test–retest reliability.49 50

Construct validity

Construct validity is the degree to which the scores of a PRO instrument are consistent with a priori hypotheses, based on the assumption that the PRO instrument validly measures the construct to be measured.47 Construct validity was studied by correlating the subscale scores of the HAGOS with the subscales of the Short Form-36 items (SF-36). SF-36 (Acute version, 1.1, Health Assessment Lab, Hillerød, Denmark, 1993) was used because it is a PRO measure that contains relevant domains for assessing physically active patients with reduced physical function and pain.51,,53 SF-36 is a generic measure of health status comprising eight subscales: Physical Functioning (PF), Role-Physical (RP), Bodily Pain (BP), General Health (GH), Vitality (VT), Social Functioning (SF), Role-Emotional (RE) and Mental Health (MH). The SF-36 is a valid and reliable instrument also when used in the Danish population.54,,56 Convergent and divergent evidence was examined by assessment of the associations between the HAGOS and SF-36 by the use of Spearman correlation. This construct validity was determined by cross-sectional comparison of the questionnaires when first administered.

A priori hypotheses were formulated.22 23 We expected the highest correlations when comparing the scales that are supposed to measure similar constructs. Since the HAGOS is designed to measure physical health in patients with hip and/or groin pain rather than mental health, we expected to observe generally higher correlations between the HAGOS subscales and the SF-36 subscales of PF, RP and BP (convergent construct validity) than between the HAGOS subscales and the SF-36 subscales of MH, VT, RE, SF and GH (divergent construct validity).

Furthermore, we hypothesised that the correlation between the HAGOS subscales ADL and Sport/Rec and the SF-36 subscale PF was at least 0.5, and higher than for the other HAGOS subscales. The correlation between the SF-36 subscale Pain and HAGOS subscales Pain and Symptoms should be at least 0.5 and 0.4, respectively, and higher than for the other HAGOS subscales. At last, for the subscale QOL, which hypothetically relates to both physical and mental health, we expected a correlation of at least 0.4 to the SF-36 subscale MH.

Responsiveness

Responsiveness is defined as the ability of a PRO instrument to detect change over time in the construct to be measured.47 For evaluating responsiveness, a Global Perceived Effect (GPE) score, where the patients rate their condition in one of seven categories was used. At a 4-month administration (follow-up), patients were asked to rate possible change in their condition since the initial administration (baseline) in relation to their hip and/or groin pain. A 4-month follow-up was chosen since this was a reasonably long timeframe to expect clinical improvement to occur in patients with long-standing hip and/or groin pain,57 though still short enough to assume that patients would be able to recall whether any changes in their condition had occurred during this period. The GPE had the following answer options: much better (3), better (2), somewhat better (1), no change (0), somewhat worse (−1), worse (−2) and much worse (−3). A priori hypotheses were formulated for responsiveness.22 23 We hypothesised that the change in scores of the six subscales of the HAGOS between the initial administration and the 4-month administration would correlate with the GPE score, and that the correlation was at least 0.4 for all subscales. Furthermore, standardised response mean (SRM) and effect size (ES) should be higher for patients who reported their condition to be better or much better, than patients reporting no change, only somewhat better or worse on the GPE score. SRM and ES should also be lower for patients reporting worse or much worse than patients reporting no change or only somewhat better or worse on the GPE score.

Interpretability

Interpretability is the degree to which one can assign qualitative meaning to an instrument's quantitative scores or change in scores.47 Interpretability includes the distribution of total scores and change scores in the study sample and in relevant subgroups, floor and ceiling effects, estimates of minimal important change (MIC) and/or minimal important difference (MID).58 Floor and ceiling effects are present if the questionnaire fails to demonstrate a worse score in the patients demonstrating signs of clinical deterioration and an improved score in patients who show clinical improvement as this can be an indication that a scale is not sufficiently comprehensive. In this study, floor and ceiling effects were defined to be present if more than 15% of the patients were reporting worst (0) or best (100) possible score.49 59

Statistical analyses

A sample size ≥100 patients and 7 times the number of items in the scale has been recommended for factor analysis.49 Unidimensionality of the different subscales was assessed by exploratory factor analysis using principal component analysis with varimax rotation in SPSS statistics (version 17.0).60 Median values were imputed in situations where missing values existed. Eigenvalues and factor loading patterns were used to identify and extract factors.61 Items with the lowest factor loading were sequentially deleted until only one eigenvalue above 1 was produced. The relative test–retest reliability has been calculated based on a linear mixed model (with participants handled as random effects). To estimate the test–retest reliability of the HAGOS subscales, ICCs (3.1, two-way mixed effects model absolute agreement) with 95% CIs were calculated.61

Measurement error was expressed as the SEM, which was calculated as SD × √1 − ICC, where SD is the standard deviation of all scores from the participants.61 62 The SEM was used for calculating the SDC at the individual level, calculated as SEM × 1.96 × √2, and at the group level calculated as SEM × 1.96 × √2 / √n.63 64 Internal consistency, or interitem correlation, was assessed by calculation of Cronbach's α of the baseline values.61 A 95% CI for the SDC was calculated using the upper and lower confidence limits of the ICC used to derive the SEM.

Convergent and divergent validity of the HAGOS and the SF-36 were investigated by Spearman's correlation coefficient. Likewise, associations on responsiveness were then measured by correlating the GPE with the change scores of each HAGOS subscale at the 4-month assessment, using Spearman's correlation coefficients. Correlations of 0.5 are considered large, 0.3 is moderate and 0.1 is small.65 Furthermore, to evaluate the responsiveness of the HAGOS, two distribution-based statistics were evaluated concerning different groups of GPE: (1) the SRM, calculated as the mean change in score divided by the SD of the change and (2) the ES, equal to the mean change in score divided by the SD of the baseline score.61 Both SRM and ES are calculated at the 4-month assessment, compared with baseline.

Results

Prospective clinical study

A prospective clinical study was designed to assess validity, reliability and responsiveness. The study was conducted at the Arthroscopic Centre Amager, Amager Hospital, Copenhagen. The Danish ethics committee of the capital region, and the Danish Data Protection Agency approved the study. Patients were recruited from primary and secondary care. One hundred and twenty-six patients were screened for eligibility during a clinical consultation by a specialist (an orthopaedic surgeon or a sports physiotherapist). One hundred and one patients were included in the study and they completed the HAGOS and SF-36 questionnaires at the initial consultation. Patients were sent the HAGOS after 1 week and asked to complete the questionnaire a second time and return it by mail as soon as possible. At the 4-month follow-up, the HAGOS and the GPE scores were sent by mail, and completed at home. At the 4-month follow-up, patients who did not respond within 3 weeks received one reminder via email or telephone. Eighty-seven patients (87%) responded at the 4-month follow-up (figure 2).

The clinical study included 50 women and 51 men, mean age 36 years, range 18–63 years. Patient characteristics including age, height, weight, body mass index, physical activity level, pain duration and pain medication use are shown in table 1. Localisation of pain according to body region was reported by all patients and the results are shown in figure 3.

Table 1

Baseline characteristics

Content validity

Item reduction

Based upon the first and second administration of the preliminary HAGOS version (table 2), item reduction was performed using the following strategy, which incorporated both quantitative and qualitative components. Individual items at the first administration (baseline) that had a median score of <1, and/or a mean score of <1, and/or where more than 50% of the respondents reported no problems, and/or more than 5% of patients had a missing response to an item, and/or a test–retest reliability (ICC 3.1, agreement) coefficient of less than 0.50 were considered possibly irrelevant for the population under study. For all 14 items identified as possibly irrelevant, four members (KT, PH, RC and EMR) of the study group voted about whether these individual items should be removed or not. Each member was told to consider the feasibility of each item based upon content, relevance, patient response and measurement qualities. Each member had one vote and items were removed if at least three of four voted for their removal. If two were for and two were against, consensus was sought by further discussion concerning the relevance of the item. Based upon this, 13 of the 14 items deemed possibly irrelevant were removed. Items P5 and P12 were removed from the Pain subscale. From the ADL subscale, items A1, A3, A4, A6, A8, A9, A10, A11, A13, A14, A15 and A17 were removed. Q4 was also considered for removal due to an ICC below 0.5 (table 2), but it was decided to keep this item, since only one person in the study group voted for its removal. After this process, the questionnaire consisted of 38 items in five subscales (Symptoms (7), Pain (11), ADL (5) Sport/Rec (10) and QOL (5)).

Internal consistency

Factor analysis of the five individual subscales showed that the items in the Symptoms, ADL and QOL subscales loaded on one factor with eigenvalues of 3.2 (46% of the variance), 3.3 (66% of the variance) and 2.9 (58% of the variance), respectively. Factor analysis of the Pain subscale showed that two factors with an eigenvalue greater than 1 were produced. Factor analysis was repeated sequentially omitting item 13 ‘Do you have any pain when squeezing your legs together?’ and the subscale only loaded on one factor, with an eigenvalue of 5.6 (56% of the variance), and item P13 was therefore removed from the questionnaire. Factor analysis of the Sports subscale showed that two factors with an eigenvalue greater than 1 were produced. Items 9 and 10 seemed to form a separate subscale and these were omitted from the Sports subscale and further tested as a separate subscale. Items 1–8 in the Sports scale loaded on a single factor, with an eigenvalue of 5.3 (66% of the variance) and items 9 and 10 loaded on a single factor, with an eigenvalue of 1.8 (89% of the variance) and this new subscale was named Participation in Physical Activity (PA). The final version of the HAGOS then held 37 items in six separate subscales: Pain (10 items), Symptoms (7 items), ADL (5 items), Sport/Rec (8 items), PA (2 items) and QOL (5 items) (appendix 1). For each of the six HAGOS subscales, Cronbach's α were above 0.78, indicating a sufficient homogeneity of all items in the subscales (table 3).

Table 3

Descriptive statistics and test–retest reliability of HAGOS

Testing the final version of HAGOS

Missing data

HAGOS: Few individual items were missing. At baseline, 9 items from a total of 101 patients × 37 items = 0.2% were missing. A total score could be calculated for all subjects for all subscales except for PA, where a total score could be calculated for all but one subject. At retest, 1 item of 44 patients × 37 items = 0.1% was missing. Test–retest analyses could be performed for 44 subjects for all subscales except for PA, where test–retest analysis could be calculated for 43 subjects. At the 4-month follow-up, 21 items of 87 patients × 37 items = 0.7% were missing.

SF-36: Few individual items were missing. At the baseline measurement, 7 items of 101 patients × 36 items = 0.2% were missing. A total score could be calculated for all subjects for all subscales.

Test–retest reliability and measurement error

Table 3 shows ICCs, SEM and SDC of all subscales of the HAGOS. Retest was completed within a mean of 11 days, and a range of 7–21 days. For all subscales of the HAGOS, the ICCs were between 0.82 and 0.92 indicating good test–retest reliability. The SDC at the individual level ranged from 17.7 to 33.8 points and at the group level from 2.7 to 5.2 points for the different subscales.

Construct validity

Generally higher correlations were found between the HAGOS subscales and the SF-36 subscales of PF, RP and BP (convergent construct validity) than between the HAGOS and the SF-36 subscales of MH, VT, RE, SF and GH (divergent construct validity) (table 4). As hypothesised, the correlations between the HAGOS subscales ADL and Sport/Rec and the SF-36 subscale PF were at least 0.5, and higher than for the other HAGOS subscales (Pain, Symptoms, PA and QOL). The correlations between the HAGOS subscales Pain and Symptoms and the SF-36 subscale BP were at least 0.5 and 0.4, respectively, and as hypothesised, higher than for the HAGOS subscales PA and QOL, but not higher than for the HAGOS subscales ADL and Sport/Rec. The subscale QOL was moderately correlated to the SF-36 subscale MH, at 0.38 but did not reach the hypothesised threshold of being at least 0.4.

Table 4

Spearman's correlation coefficients (r) determined when comparing the six dimensions in HAGOS to the eight different subscales in SF-36, N = 101

Responsiveness

As hypothesised, change in the six subscales of the HAGOS correlated with the GPE score, and the correlation was at least 0.4 for all subscales. As hypothesised, ES and SRM were lower for patients reporting worse or much worse than patients reporting somewhat worse, no change or somewhat better on the GPE score, for all subscales. Furthermore, ES and SRM for all subscales were higher for patients who reported their condition to be better or much better than patients reporting no change or only somewhat better or worse on the GPE score (table 6).

Table 6

Responsiveness

Interpretability

Floor and ceiling effects, predefined as present if more than 15% of the patients were reporting worst (0) or best (100) possible score, were found for the HAGOS subscales PA and ADL at some time points. Much larger floor and ceiling effects (40–80%) were seen for some of the SF-36 subscales. The distributions of total scores and change scores in the study sample and in relevant subgroups are presented in tables 5 and 6, and floor and ceiling effects of the HAGOS and SF-36 are presented in table 5.

Table 5

HAGOS score, baseline and 4-month assessment and SF-36 score, baseline assessment

Discussion

The HAGOS is, to our knowledge, the first patient-reported questionnaire developed for young to middle-aged physically active patients with long-standing hip and groin pain, using a prospective research design. Furthermore, this is one of the first studies following the full COSMIN checklist in the development and testing of a PRO instrument – a checklist based on the recent international consensus process involving leading experts in the development and testing of PRO questionnaires.22 23 The current study therefore stringently follows the mandatory steps concerning reliability, validity and responsiveness.22 23

We found the checklist easy to use and helpful when designing the current study. The purpose of the COSMIN checklist is to evaluate the methodological quality of studies concerning measurement properties of PRO instruments. However, it is important to be aware that the COSMIN checklist is not yet aimed for a specific evaluation of the quality of the PRO instruments themselves.22 23 In the current study, we therefore had to rely on criteria for what constitutes adequate measurement qualities previously proposed by different authors.48 49 In order to assess the quality of PRO instruments, we agree with the COSMIN panel that future consensus regarding criteria for what constitutes adequate measurement qualities should be included in the COSMIN recommendations58 to ensure methodological standardisation of this part of the process as well.

Content validity

In contrast to the development of many previous PROs concerning hip disability,20 the HAGOS meets the standards for the development of a PRO instrument by including patients in the development process.49 61 A study by Martin et al,66 involving patients comparable with the patients in the current study, showed that large discrepancies exist between clinicians and patients when they are asked to rate the importance of different questions related to hip problems.66 This study by Martin et al66 indicates that these patients perceive questions related to sports and recreation and social-emotional aspects to be of most importance. This seems to be in accordance with the results of the current study, where the lowest baseline scores existed in the subscales Sport/Rec, PA and hip and/or groin-related QOL.

Internal consistency

Unidimensionality of a (sub)scale indicates that all the items measure the same aspect.61 The factor structures of the preliminary HAGOS subscales Pain and Sport/Rec were not unidimensional. Therefore, remodelling the factor structure of these subscales and creating a new subscale (PA) seemed warranted. In the process of remodelling the factor structure, we removed one item in the Pain subscale, since this item did not conceptually fit under any of the other factors. This item asks about pain when ‘squeezing your legs together’ and may be difficult for patients to comprehend, since this is not a frequent activity or movement that all patients perform. This item was included by the expert panel and may represent a more clinical way of thinking, since the adductor squeeze is an important clinical test performed in this population.27 28 67 68 The factor analysis revealed that two items formed a separate subscale concerning the ability to participate in physical activity (PA). The PA subscale seems highly relevant for the population that it is intended for because the inability to fully participate in sports and other physical activities often is one of the most frustrating aspects for these individuals.

Test–retest reliability and measurement error

The ICC values were adequate for all subscales indicating adequate test–retest reliability at the group level.48 49 The SDC for the subscales ranged from 15 to 18 points for the subscales Pain, Symptoms, ADL, Sport/Rec and QOL. For the PA subscale, the SDC was 34 points. Changes above SDC values can be considered real changes at the individual level. Large SDC values at the individual level (SDCindividual) in the current study are common findings concerning patient-reported questionnaires,69 70 indicating that patient-reported questionnaires can be problematic for use at the individual level, due to their incapacity to detect minimal but still clinically important changes.50 At the group level, the SDC (SDCgroup) ranged from 2.7 to 5.2 for the different subscales, which means that changes above 5 points in group mean scores can be detected with 95% confidence. The fact that the SDCgroup is much smaller than the corresponding SDCindividual implies that the HAGOS is much better at detecting changes at a group level.

Construct validity

Validation of instruments assessing PROs is a challenge since no gold standard is available for comparisons.58 Instead, construct validity has been assessed by correlating the new measure with already existing well-validated measures for similar constructs (convergent construct validity) and dissimilar constructs (divergent construct validity).58 Being the first PRO for physically active patients with hip and/or groin pain, obviously no ideal instrument for comparison existed. We therefore chose to use the SF-36, since this is a well-validated measure,54,,56 with adequate measurement qualities, which has been used in similar populations with similar musculoskeletal complaints from other anatomical regions.51,,53

Responsiveness

Responsiveness is a very important measurement quality in an outcome score,48 because it is an indication of the PRO's ability to detect when patients are undergoing relevant clinical changes.48 49 In the COSMIN process, it was recommended that appropriate measures to evaluate responsiveness are the same as those for hypotheses testing and construct validity, with the only difference being that the hypotheses should focus on the change score of the instrument.58 The GPE score is only based on one transition question and has therefore been assumed to be less reliable than a multi-item instrument.71 However, despite its possible lack of measurement precision, all a priori hypotheses concerning responsiveness of all the HAGOS subscales were confirmed in the current study and showed high correlations between the GPE score and the change scores of the HAGOS subscales ranging between 0.56 and 0.69. ESs for the different subscales for patients reporting to be ‘better’ or ‘much better’ ranged from 0.9 to 1.2 for Symptoms, Sport/Rec and PA, whereas it was 0.77 for ADL and 1.78 for QOL. This indicates that more patients are needed for a clinical trial when the ADL subscale is the primary outcome, and fewer patients are needed when QOL is the primary outcome, compared with when using the subscales Symptoms, Sport/Rec and PA as primary outcomes.

Interpretability

Few patients reported a floor or ceiling score for the HAGOS, indicating a possibility to measure both improvement and deterioration over time. The exception was the subscale PA where 39 subjects reported worst possible score (floor effect) at the initial administration and 28 patients reported worst possible score at the 4-month administration. A floor effect of the PA subscale was, however, not surprising considering the response options in these items. The answer options to the questions concerning the ability to participate in physical activities ranges from ‘always’ to ‘never’. It is not possible to participate to a degree less than ‘never’, and therefore the high number of patients answering ‘never’ to these questions does not seem problematic for the subscale because further deterioration is not possible. Instead we believe that the floor effects in this subscale emphasise the relevance of these items for the population under study. The floor effect could most likely be avoided in the future if easier items are added to the PA scale. However, items concerning PA should be patient derived (in order for it to have true content validity), and thus should be based on further patient interviews focusing on this particular issue. For the ADL subscale, a ceiling effect was present at the 4-month assessment. Again, this is hardly surprising since the items concerning function and ADL are usually not the most important for the population under study.66 However, for patients with severe hip and groin pain assessing their limitations in daily activities may still be relevant.

Large ceiling effects were seen in the SF-36 for the subscales RP, SF and RE, indicating that these subscales may not be very relevant for the population in the current study. However, for the subscales PF and BP, which were primarily used for testing convergent validity in the current study, no floor and ceiling effects existed.

The MIC or the MID has been proposed for establishing cut-points for minimal but still patient-relevant clinical improvements. The MIC is the smallest change in score (within a patient) in the construct that can be measured that patients still perceive as important.58 The MID is the smallest difference in the construct that can be measured (between patients) that is considered important.58 There is an ongoing debate in the literature, about which methods should be used to determine the MIC and/or the MID of a PRO instrument.58 Within the COSMIN Delphi process, no consensus on standards for assessing MIC or MID could be reached,58 which is also reflected in the large variation in reporting and interpretation of these concepts in the literature.71 However, it has been shown that under many circumstances, when patients with a chronic disease are asked to identify minimal change, the estimates fall very close to half an SD.72 The MIC of the HAGOS subscales would fall between 10 and 15 points for the six subscales, using this approach (table 5). We recognise that future research on the interpretability of PRO instruments may provide new evidence which necessitates a different approach. Until then, we agree with Norman et al72 that applying the rule of thumb that the estimates of the MIC fall very close to half an SD does not seem inappropriate in the absence of more specific information.

Methodological limitations

For practical reasons, the second and third administration of the questionnaire was done by the patients at home, and therefore performed in an environment different from the hospital setting. Since the administration of all the questionnaires used in this study is completely self-administered, we do not believe that this poses a methodological problem. However, whether this approach has any impact on the results remains uncertain.

Item response theory (IRT) is a relatively new method to evaluate questionnaires in healthcare and has some potential advantages over classical test theory.61 73 The Rasch model, a mathematical model applied in IRT, has been used to develop and internally validate measures, and it uses a logistic function that creates an interval-scaled measure.61 74 The sample size of the current study was too small for Rasch analysis since we needed a sample size of at least 200 patients for analysing this kind of instrument.75 However, Rasch analysis should certainly be considered for possible improvements of the HAGOS in the future when a larger sample size can be included. Moreover, testing of reliability, validity and responsiveness of PROs should be an ongoing process and the most optimal and constructive approach concerning the HAGOS is to modify the scale if new knowledge about its psychometric properties emerges. We are, however, confident that HAGOS in its present form will improve the current evaluation of physically active patients with hip and groin pain.

Another limitation of the HAGOS is that it was only tested in Denmark. However, based upon the experiences of HOOS which was originally developed in Swedish38 this should not be a barrier to translation into other languages. Since Danish is not a world language, we decided to translate and cross-culturally adapt the HAGOS to an English version according to existing guidelines.39 40 This version is given in online appendices 1 and 2. HAGOS can be downloaded from http://www.koos.nu/.

Conclusion

The HAGOS questionnaire has adequate measurement qualities for the assessment of symptoms, activity limitations, participation restrictions and QOL in physically active young to middle-aged patients with long-standing hip and/or groin pain. The HAGOS should be implemented in the evaluation of treatment strategies and regimens for physically active patients with long-standing hip and/or groin pain in relevant situations where the patient's perspective and health-related QOL are of primary interest.

Acknowledgments

The authors would like to thank all the people involved in the study: patients, doctors, nurses and physiotherapists at the Arthroscopic Centre Amager, Amager Hospital for participating or helping out during the study; the expert group who contributed to the development of the HAGOS: Physiotherapist Niels Bo Schmidt from the Sportsmedicine Clinic, Amager Hospital, Physiotherapists Pernille Mogensen and Theresa Bieler from the Department of Physiotherapy, Bispebjerg Hospital; orthopaedic surgeons Torsten Warming from the Sportsmedicine Clinic, Hamlet, Frederiksberg, Claus Ol Hansen and Otto Kraemer from the Arthroscopic Centre Amager, Amager Hospital for assisting in screening patients for the study; Professor Peter Magnusson and associate professor Nina Beyer, from the Musculoskeletal Research Unit, Department of Physiotherapy, Bispebjerg Hospital and Senior Research Fellow Anthony Schache and PhD student Joanne Kemp from the Department of Engineering, Melbourne University for assisting in the translation and cross-cultural adaptation of the HAGOS from Danish to English.

References

Footnotes

  • Funding This work was funded by the Arthroscopic Centre Amager, Department of Orthopaedic Surgery, Amager University Hospital, Denmark, The Association of Danish Physiotherapists, Danish Regions, The Lundbeck Foundation and the Danish Rheumatism Association. RC is funded by grants from the OAK foundation.

  • Competing interests None.

  • Ethics approval The Danish ethics committee of the capital region approved the trial protocol (H-C-2007-0129), which was registered with the Danish Data Protection Agency (2007-41-1606).

  • Provenance and peer review Not commissioned; externally peer reviewed.

Linked Articles

  • Corrections
    BMJ Publishing Group Ltd and British Association of Sport and Exercise Medicine