Article Text

Download PDFPDF

Selecting outcome measures in sports medicine: a guide for practitioners using the example of anterior cruciate ligament rehabilitation
  1. N P Bent1,
  2. C C Wright1,
  3. A B Rushton1,
  4. M E Batt2
  1. 1
    School of Health and Population Sciences, University of Birmingham, Birmingham, UK
  2. 2
    Centre for Sports Medicine, Nottingham University Hospitals, Nottingham, UK
  1. Correspondence to Mr N P Bent, School of Health and Population Sciences, University of Birmingham, 52 Pritchatts Road, Edgbaston, Birmingham B15 2TT, UK; n.p.bent{at}bham.ac.uk

Abstract

Using examples from the field of anterior cruciate ligament rehabilitation, this review provides sports and health practitioners with a comprehensive, user-friendly, guide to selecting outcome measures for use with active populations. A series of questions are presented for consideration when selecting a measure: is the measure appropriate for the intended use? (appropriateness); is the measure acceptable to patients? (acceptability); is it feasible to use the measure? (feasibility); does the measure provide meaningful results? (interpretability); does the measure provide reproducible values? (reliability); does the measure assess what it is supposed to assess? (validity); can the measure detect change? (responsiveness); do substantial proportions of patients achieve the worst or best scores? (floor and ceiling effects); is the measure structured and scored correctly? (dimensionality and internal consistency); has the measure been tested with the types of patients with whom it will be used? (sample characteristics). Evaluation of the measure using these questions will assist practitioners in making their judgements.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Sports and health practitioners who are responsible for the management of injured athletes are routinely required to make decisions regarding the timing of exercise progression, the commencement of functional activities and return to competitive play.1 The desire for the quickest possible return to sport must be balanced by considerations of athletes’ safety and minimisation of re-injury risk. This might be accomplished by adopting an outcomes-based approach to treatment progression, in which patients must achieve specific outcomes before proceeding to more advanced levels of activity.2 Such a strategy allows the quantification of an athlete’s functional ability, comparison with pre-injury status and verification that an appropriate level of rehabilitation has been achieved.1

A number of outcome measures exist for monitoring progress and facilitating clinical decision-making during the rehabilitation of active individuals.3 In addition, these measures might be used to determine the clinical and cost effectiveness of treatments, as well as providing benchmarks of pre-injury performance.4 However, such measures are infrequently incorporated into routine practice. For example, a survey of Australian orthopaedic surgeons revealed that the majority would allow return to sport following anterior cruciate ligament (ACL) reconstruction without first assessing muscular strength.5 In addition, an investigation into professional rugby union in New Zealand found that players do not always undergo fitness testing before returning to competition.6

A possible reason for this lack of use is that selecting an outcome measure is a difficult task. Practitioners need to be familiar with the range of measures available to them, as well as published studies in which measurement properties (eg, the reliability and validity) of these measures have been assessed. Although aware of such properties, practitioners might feel insufficiently familiar with them to make definitive judgements regarding a measure’s suitability. This critical review therefore aims to provide a comprehensive, yet user-friendly, guide to selecting outcome measures for use with active populations. Using illustrative examples from the field of ACL rehabilitation, it is intended to assist practitioners in judging the suitability of measures, as well as better understanding the published literature surrounding them.

Types of outcome measure

Rehabilitation outcome measures can be categorised, based upon their method of data acquisition, as either patient-reported or clinician-reported measures,7 often referred to as subjective and objective measures, respectively. Patient-reported measures are questionnaires, such as the International Knee Documentation Committee (IKDC) subjective knee form, which contains items relating to symptoms and functional limitations experienced during activities of daily living and sports.8 Clinician-reported measures incorporate performance-based tests (eg, strength and hopping ability), as well as passive, clinical tests (eg, the instrumented measurement of anterior knee laxity).9 10

Traditionally perceived as less valid than clinician-reported measures, patient-reported measures have gained favour in recent years due to their focus on issues important to patients. Nevertheless, the poor correlation between these two types of measure has prompted suggestions that they might provide different, yet complementary, information regarding a patient’s functional ability, and that a combination of the two might provide the most comprehensive method of assessment.11 12

Questions to ask when selecting an outcome measure

Various authors have described properties that should be considered when selecting an outcome measure.13 14 15 Based on such recommendations, a series of questions are presented that should be considered when selecting a patient or clinician-reported measure for use with active patients (see box 1). These are now discussed in greater detail.

Box 1 Questions to ask when selecting an outcome measure

  • Is the measure appropriate for the intended use? (appropriateness)

  • Is the measure acceptable to patients? (acceptability)

  • Is it feasible to use the measure? (feasibility)

  • Does the measure provide meaningful results? (interpretability)

  • Does the measure provide reproducible values? (reliability)

  • Does the measure assess what it is supposed to assess? (validity)

  • Can the measure detect change? (responsiveness)

  • Do substantial proportions of patients achieve the worst or best scores? (floor and ceiling effects)

  • Is the measure structured and scored correctly? (dimensionality and internal consistency)

  • Has the measure been tested with the types of patients with whom it will be used? (sample characteristics)

Is the measure appropriate for the intended use? (appropriateness)

An outcome measure should match the purpose for which it is used.13 Known as appropriateness, this is a matter of selecting “the right tool for the job” and requires consideration of who and what is being measured, and why.

Who is being measured?

It is important to select a measure that has been purposely designed for the types of patients with whom it will be used.16 First, a measure should be appropriate for the patient’s injury or condition. Generic measures can be used with various patient groups because they address all aspects of quality of life. Specific measures are intended for patients who share a particular feature. This could be the same injury or condition (condition-specific), injury to the same body part (site-specific), or the same signs and symptoms of injury (dimension-specific).17 Examples are shown in table 1. The more condition-specific a measure, the more likely it is to measure outcomes specific to that particular condition, but the less likely it is to measure overall health and quality of life.13

Table 1

Types of outcome measure

A measure should also be appropriate for the patient’s activity level. For example, measures used with elite athletes will need to be capable of measuring higher levels of physical function than those used with recreational athletes. As many measures are not designed for use by highly active individuals, it is important to examine thoroughly any for appropriateness before use.

What is being measured?

When deciding which outcomes to measure, it is useful to consider four separate categories: body structure impairments (the injury itself); body function impairments (signs and symptoms of injury); activity limitations (inability to perform functional activities) and participation restrictions (inability to participate in life situations).21 Examples are shown in table 2. The choice of which to measure should be based upon their relevance to the patient, the stage of rehabilitation and clinical judgement.

Table 2

Outcome categories, adapted from World Health Organization21

Why is the measurement being taken?

The choice of measure will depend on whether it is being used for discriminative (telling patients apart) or evaluative (monitoring patient change) purposes.22 Discriminative measures categorise patients based upon their scores at a particular point in time. For example, the Cincinnati knee rating system (Cincinnati system) ranks patients from “poor” to “excellent”.23 Evaluative measures are used to monitor patient improvement or deterioration by comparing their scores at different time points. These measurement aims are not mutually exclusive, with many measures, including the Cincinnati system, fulfilling both.

Is the measure acceptable to patients? (acceptability)

A measure should be acceptable to patients, ie, it should not take an unreasonable length of time to complete, or expose patients to an unacceptable level of injury risk or physical and emotional strain.13 In addition, a patient-reported measure should contain clear, concise and unambiguous questions, written in easily intelligible language.24 Ideally, patients’ views on these issues should be canvassed during the design of a measure; however, acceptability can be assessed by reviewing studies that have used a particular measure, for evidence of patient complaints, completion rates, and missing data. For example, in one study, 60% of ACL-deficient “non-copers” (patients with poor dynamic knee stability) refused to undertake hop testing due to fear of injury, suggesting poor acceptability for this patient group.25

Is it feasible to use the measure? (feasibility)

The feasibility of using a measure (ie, the time and resources required to conduct, score and analyse it) is another important consideration.13 For example, a questionnaire such as the Lysholm score26 can be completed independently by patients, scored and analysed quickly using a calculator and incurs only photocopying costs. Conversely, the isokinetic assessment of muscular strength requires that clinicians be trained in the use of an expensive and space-consuming piece of equipment, supervise patients during familiarisation and testing sessions, and be competent in the use of computer software required for data analysis. Whether or not such considerations influence the choice of measure will depend upon the circumstances in which it is to be used and the resources available at that time.

Does the measure provide meaningful results? (interpretability)

A measure must provide results that are meaningful to practitioners and patients. This is known as interpretability.27 For discriminative purposes, a patient’s score should indicate whether they are functioning below, above, or at a normal level.28 In the case of many clinician-reported measures, such as hop and strength testing, this is simply achieved through comparison with the uninvolved limb.29 Although easily interpretable, such comparisons assume limb equality before injury and that the uninjured limb will be unaffected by injury and unimproved by rehabilitation. There is evidence casting doubt on these assumptions.29 30 31

Scores on patient-reported measures are often translated into meaningful labels, based on little more than the subjective judgements of their creators. For example, a Lysholm score higher than 94 is said to indicate a “normal” knee.32 However, using the same benchmark for all patients is problematical, as what is normal for a 50-year-old office worker might be subnormal for a 20-year-old soccer player. To provide more meaningful comparisons, normative data should be presented so that an individual’s score can be compared with uninjured people of the same age, gender and activity level.33 Ideally, pre-injury performance records would be available for each individual patient. These could then act as benchmarks to be reached before return to full activity.34

For evaluative purposes, it is necessary to know whether a change in a patient’s score is important or trivial.28 For this reason, the minimal important difference (MID) of a measure should be considered.35 This is the smallest change in score that a patient, or practitioner, perceives as important. For example, the MID of the IKDC form was found to be 11.5 points.36 A patient whose score increases by less than this amount might still be improved, but not enough to be considered important.

There are several methods of estimating the MID of a measure, a discussion of which is beyond the scope of this paper but can be found elsewhere.37 38 39 What is important for practitioners is that a justified MID is available. Unfortunately, there are many measures, such as the Cincinnati system, for which this is, as yet, not the case.

Does the measure provide reproducible values? (reliability)

A measure should provide similar values on repeated administrations in unchanged patients, a concept referred to as reliability.40 The different types of reliability are explained in box 2.41 For measures involving a practitioner (eg, clinician-reported measures), good intra and interrater reliability is important.42 For measures in which no practitioner is involved (eg, patient-reported measures), good test–retest reliability is required.43

Box 2 Types of reliability

  • Intrarater reliability is the degree to which measurements taken by the same practitioner are consistent. In an intrarater reliability study, one practitioner takes measurements from the same group of patients on two or more occasions.

  • Interrater reliability is the extent to which measurements taken by different practitioners are similar. In an interrater reliability study, two or more practitioners take measurements from the same group of patients on the same occasion.

  • Test–retest reliability is the extent to which patients completing a measure provide consistent results. In a test–retest reliability study, the same group of patients completes a measure on two or more occasions.

When reading the results of a reliability study, there are two questions to consider.

Do scores on the measure change with repeat assessments? (systematic bias)

The trend for all patients’ scores to either improve or worsen between repeat assessments is termed systematic bias.44 For example, in a study investigating the intrarater reliability of isokinetic testing, every patient might show improvement between their first and second assessments because of increased familiarity with the use of the dynamometer.45 This is a form of systematic bias known as a learning effect.

An example of systematic bias in an interrater reliability study would be when patients undertaking hop tests all jump further when measured by a practitioner who provides greater verbal encouragement than others. An example in a test–retest reliability study would be when patients all report greater pain levels when completing a questionnaire in a cold room on their first assessment, compared with a warm room on their second assessment.

Systematic bias in a reliability study will be reflected by a difference in mean scores between assessment occasions.42 If such a difference exists, and is large enough to be considered clinically important, then it is not worth reading further. Reliability studies should be designed to minimise systematic bias.46

Which reliability statistics have been calculated?

When important systematic bias is not present in a reliability study, it is appropriate to consider which reliability statistics have been reported. Two commonly utilised statistics are the intraclass correlation coefficient (ICC), and the kappa coefficient, with possible values for both ranging from 0 (no reliability) to 1 (perfect reliability).43 The ICC is used for measurements on a continuous scale (eg, hop distance) and the kappa for measurements on a categorical scale (eg, mild, moderate and severe). It is recommended that a reliability of 0.7 is required when using a measure for research but a value of 0.9 is necessary when making decisions regarding individuals.47 In reality, few measures are this reliable. In a recent study, only one of four hop tests had a reliability exceeding 0.9.29 Using measures that fall somewhat short of this benchmark (0.7 and above) is still preferable to not using any measures at all; however, practitioners should be cautious about making important treatment decisions based on their results.48

A useful and complementary statistic to the ICC, for measurements on a continuous scale, is the standard error of measurement (SEM).43 The SEM represents the amount of error associated with a measure and is expressed in the actual units of measurement. For example, a reported SEM of the single-hop test is 4.56 cm.49

The SEM can be used to estimate a range of scores that contains a patient’s “true score”. This is known as a confidence interval (see box 3).29 The SEM can also be used to estimate the minimum detectable change (MDC)—the smallest change in an individual’s score that is considered to be a true change and not measurement error (see box 4).50 When estimating a confidence interval or the MDC, practitioners may choose how confident they want to be that the estimation is correct. Ninety per cent confidence is recommended when dealing with individual patients29 50 and is used for the examples shown.

Box 3 Estimating a patient’s true score: an example using the single-hop test

  • Measure the distance hopped by the patient

    • eg, 100.0 cm

  • Multiply the SEM by 1.64

    • eg, SEM for the single-hop test  =  4.56 cm

    • 4.56 cm × 1.64  =  7.5 cm

  • Add 7.5 cm to the patient’s score

    • eg, 100.0 cm + 7.5 cm  =  107.5 cm

  • Subtract 7.5 cm from the patient’s score

    • eg, 100.0 cm − 7.5 cm  =  92.5 cm

  • The 90% confidence interval is 92.5 cm to 107.5 cm

  • We can be 90% confident that the patient has hopped at least 92.5 cm but not more than 107.5 cm.

Box 4 Estimating the MDC: an example using the single-hop test

To estimate the MDC with 90% confidence (MDC90):

  • Multiply the SEM by 2.32

    • eg, SEM for the single-hop test  =  4.56 cm

    • 4.56 cm × 2.32  =  10.6 cm

  • MDC90 is10.6 cm

  • If a patient’s hop score improves or worsens by less than 10.6 cm, we can be 90% confident that they are unchanged.

An alternative approach to the MDC is Bland and Altman’s limits of agreement.51 For example, reported limits of agreement for the Lysholm score are −4.2 to 11.8,52 meaning that a patient deteriorating by less than 4.2 or improving by less than 11.8 points would be considered unchanged.

Does the measure assess what it is supposed to assess? (validity)

The extent to which a measure assesses what it is supposed to assess is termed validity.43 There are four types of validity that practitioners need to consider.

Does the measure appear to be valid? (face validity)

The simple question of whether a measure appears to be valid is known as face validity.24 This is an important consideration as patients are more likely to cooperate fully during assessments that they perceive to be relevant.53

Is the measure comprehensive? (content validity)

The extent to which a measure covers all important aspects of the constructs (concepts such as knee symptoms or quality of life) being measured is known as content validity.4 For example, a questionnaire assessing ACL injury symptoms would have poor content validity if it neglected to include questions about pain, an essential element of the construct being investigated. Content validity is only applicable to measures that comprise more than one component. For example, the Lysholm score consists of a number of separate questions,26 whereas hopping ability is often measured using a battery of several different hop tests.29

When assessing a measure’s content validity, practitioners should look for published evidence that its creators were thorough and systematic in deciding which components to include.54 55 Authors should first specify and justify the constructs that they propose to measure and the patient groups for whom their measure is intended. Components for inclusion should then be selected on the basis of literature reviews, expert panel discussions and, because a measure should include components important to its target population, the views of patients, ascertained through interviews and focus group surveys.

Do scores on the measure correlate with those of a “gold standard”? (criterion validity)

A measure should correlate highly with other measures that assess the same construct and are already known to have excellent validity (gold standard measures). This is known as criterion validity, in which correlations of at least 0.7 are considered acceptable (0, no correlation; 1, perfect correlation).15 For example, in a study investigating the validity of goniometry for measuring knee range of motion, correlation with the gold standard of radiographic imaging was as high as 0.99.56 Because few gold standard measures exist, criterion validity is rarely assessed and practitioners will need to look for evidence of construct validity instead.53

Does the measure relate to other measures and variables as expected? (construct validity)

The degree to which a measure relates to other measures and variables in accordance with theoretically derived hypotheses is termed construct validity.15 There are three main types of construct validity.

Does the measure correlate well with related measures? (convergent validity)

A measure should show correlation with other valid measures to which it is related, a concept called convergent validity.43 Because convergent validity does not involve comparison with a gold standard, very high correlations are not expected.24 Instead, the extent of any anticipated correlation should be postulated and justified in advance.15 For example, as hypothesised, the knee outcome survey activities of daily living scale showed a correlation greater than 0.6 with the Lysholm score.57

Does the measure correlate poorly with unrelated measures? (divergent validity)

As well as correlating well with related measures, a measure should not correlate too strongly with unrelated measures (ie, the correlation should be below 0.3). This is referred to as divergent validity.24 For example, as hypothesised, a low correlation (0.18) was observed between Cincinnati system scores and patients’ ages.23

Can the measure detect differences between subgroups of patients? (known-groups validity)

A measure should be able to discriminate between subgroups of patients who differ in some respect, such as age, gender, injury severity, or disability level. This is called known groups validity.58 For example, as hypothesised, ACL-reconstructed patients with deteriorated articular cartilage were found to have significantly lower Cincinnati system scores than those with healthy cartilage.23

Can the measure detect change? (responsiveness)

Evaluative measures must be able to detect real change in a patient’s condition, a property termed responsiveness.59 Responsiveness studies usually involve a measure being used with patients before and after a period of treatment to see whether it can detect the changes that occur. As with construct validity, responsiveness is assessed by testing theoretically derived hypotheses.15 These hypotheses are usually associated with the following three questions.

Can the measure detect the effects of treatment?

A measure should be able to detect the effects of treatment.60 Two statistics commonly used for this purpose are the effect size (ES) and standardised response mean (SRM) (see box 5). For both statistics, values of at least 0.2, 0.5 and 0.8 indicate that small, moderate and large changes have been detected, respectively.61 The magnitude of any expected treatment effect will depend on the type of treatment given and so should be postulated and justified in advance.62 For example, as hypothesised, the IKDC form detected a large improvement in knee-injured patients following a course of treatment, demonstrated by an ES and SRM larger than 0.8.36

Box 5 Effect size and standardised response mean

Embedded Image Embedded Image

Do changes on the measure correlate with changes on related measures? (longitudinal convergent validity)

The change recorded by a measure should show correlation with the change recorded by a related and valid measure.60 This is referred to as longitudinal convergent validity.63 The extent of any anticipated correlation should be postulated and justified in advance.15 For example, the ability of four hop tests to detect change in patients undergoing postoperative ACL rehabilitation was assessed by correlating change in hop scores with change on a patient-reported measure (the lower extremity functional scale).29 A correlation of 0.5 was prespecified as evidence of good responsiveness but was not achieved.

Can the measure discriminate between subgroups of patients who change by different amounts? (longitudinal known-groups validity)

A measure should be able to discriminate between identifiable subgroups of patients who change by different amounts.60 This is known as longitudinal known-groups validity.63 For example, as hypothesised, the IKDC form detected greater improvement in patients undergoing treatment for ACL injury than for osteoarthritis.36

Do substantial proportions of patients achieve the worst or best scores? (floor and ceiling effects)

On some measures, particularly patient-reported measures, it is possible to achieve a worst possible or best possible score. If substantial proportions of patients (15–20%) achieve the worst possible score, floor effects are said to be present, indicating that the measure is too difficult. If substantial proportions of patients achieve the best possible score, ceiling effects are present and the measure is too easy.15 33 For example, 37% of patients awaiting ACL reconstruction achieved the worst possible score on the Cincinnati system’s sports function subscales, and 39% achieved the best possible score at final postoperative follow-up.23

When using a measure with sporting individuals, ceiling effects are of particular concern.3 As their normal level of physical function is very high, these patients might achieve the best possible score on a measure and be deemed “normal” well before reaching full fitness. In addition, no further improvements in function would be detectable.

Is the measure structured and scored correctly? (dimensionality and internal consistency)

Measures comprising more than one component should be structured and scored in a way that is consistent with the number of constructs they measure (their dimensionality).14 Measures that assess only one construct, such as the IKDC form,8 are called unidimensional and scores from their components can be summed together to form an overall score.13 Tests that measure more than one construct are called multidimensional; their components are grouped into distinct, unidimensional sections, each measuring a single construct and with its own total score.14 For example, the knee injury and osteoarthritis outcome score comprises five distinct subscales: “pain”, “symptoms”, “activities of daily living”, “sport and recreation” and “quality of life”.64 The dimensionality of a measure can be assessed through statistical techniques such as factor analysis or Rasch analysis.14 For example, factor analysis demonstrated that all questions on the IKDC form were measuring a single construct.8

When a measure’s dimensionality has been determined, there is one further consideration. All components within a section should be measuring the same construct and therefore be highly correlated with one another. This concept is called internal consistency and is usually measured using a statistic called Cronbach’s alpha.65 A value of 0.7 is recommended for research purposes but a value of 0.9 for use with individuals.47 For example, Cronbach’s alpha for the IKDC form was 0.92.8

Has the measure been tested with the types of patients with whom it will be used? (sample characteristics)

When reading the results of a study in which the measurement properties of a measure have been investigated, it is essential to note the characteristics of the patients involved. First, it is important that the sample size is adequate. A measure may appear reliable, but this conclusion might be dubious if only a small number of patients have been tested. Therefore, a study’s sample size should be justified, preferably using a power calculation.66

Second, it is important to consider the types of patients employed in a study, as all measurement properties discussed in this article are population-specific.43 For example, a measure that is valid for ACL-injured patients might not be valid for posterior cruciate ligament-injured patients. In addition, a measure that is reliable for recreational athletes might not be reliable for elite athletes. Therefore, when possible, a measure should be selected that has sound measurement properties for the types of patients with whom it will be used.

What is already known on this topic

Numerous outcome measures exist for use with active patients but are infrequently used in routine clinical practice. Although aware of measurement properties such as reliability and validity, sports and health practitioners might feel insufficiently familiar with them to make definitive judgements regarding an outcome measure’s suitability for their patients.

What this study adds

This paper provides sports and health practitioners with a series of questions that should be asked when selecting an outcome measure. By considering these questions, practitioners will be better able to judge the measures available to them and select those most suitable for their patients.

Conclusion

A series of questions have been presented for sports and health practitioners to consider when selecting outcome measures for use with active patients. Practitioners must judge whether a measure is appropriate, acceptable, feasible, interpretable, reliable, valid, responsive, free of floor and ceiling effects and structured correctly, for use with the types of patients of interest to them. Some of these judgements, such as whether it is feasible to use a measure, can be made based on familiarity with the measure itself. Others, such as whether a measure is reliable, must be based on published evidence. This article is intended to assist practitioners with these judgements.

Although knowledgeable in how to evaluate an outcome measure, selecting one might still seem like a daunting task. There are a large number of measures available, with one review identifying 16 different questionnaires intended for use with knee-injured patients alone.67 Lack of confidence when selecting outcome measures might be a barrier that prevents their clinical use despite recognised benefits for patient management. However, practitioners should remember that selecting an outcome measure is a skill that needs to be practised like any other and will improve with use. Utilisation of outcome measures should increase as practitioners become more comfortable with evaluating them.

REFERENCES

Footnotes

  • Competing interests None.

  • Provenance and Peer review Not commissioned; externally peer reviewed.