Article Text
Abstract
This paper will help clinicians and researchers to understand studies on the validity, responsiveness and reliability of patient-reported outcome measures (PROMs) and to interpret the scores and change scores derived from these and other types of outcome measures. Validity studies provide a method for assessing whether the underlying construct of interest is adequately assessed. Responsiveness studies explore the longitudinal validity of a test and provide evidence that an instrument can detect change in the construct of interest. Reliability is commonly assessed with correlation indices, which indicate the stability of repeated measurements and the ‘noise’ or error in the measurement. Proposed indicators for clinical interpretation of test scores are the minimum clinically important difference, the standard error of measurement and the minimum detectable change. Studies of the Victorian Institute of Sports Assessment questionnaire for patellar tendinopathy and other PROMs are used to illustrate concepts.
- Measurement
- Evaluation
Statistics from Altmetric.com
Introduction
A range of clinician-applied standardised tests are used in clinical practice for the purpose of diagnosis (eg, Lachman test for anterior cruciate ligament integrity) or for the evaluation of various aspects of body function and physical performance (eg, flexibility, squat, hop and jump tests). Patient-reported outcome measures (PROMs)1 measure patient perceptions of specified aspects of their own health that either cannot be directly observed (eg, pain) or that are not practical or feasible to directly observe (eg, performance of daily activities). PROMs may be generic health status measures, such as the SF-36 Health Survey,2 region-specific, such as the Disabilities of Arm, Shoulder and Hand (DASH)3 or condition-specific, such as the Victorian Institute of Sports Assessment (VISA-A) for Achilles4 and VISA-P for patellar5 tendinopathy. The VISA-P has been translated to at least six languages.
PROMs vary with respect to the number of items or questions, rating-scale options and anchors, and whether higher scores indicate better or worse health states. The VISA-P has eight items: 1–7 are rated on a 0–10 numerical rating scale and item 8 is scored from 0 to 30. Item scores sum to a total score between 0 and 100 with higher scores indicating better function.
The aim of this paper is to assist readers to interpret studies on the reliability, validity and responsiveness of PROMs. Whereas these test characteristics apply to both clinician-observed and patient-reported measures, the focus of this paper is on how these are tested and reported for PROMs. We focus on how the reader should interpret scores and change in scores for individuals who complete questionnaires as part of assessment and reassessment. The values of the minimum clinically important difference (MCID),7 the standard error of measurement (SEM)8 and the minimum detectable change (MDC) are particularly useful for clinical interpretation of test scores. Used together, they describe measurement change that the clinician can distinguish from error, and change that is important to patients.9 The MCID is derived from longitudinal validity studies whereas SEM and MDC are derived from reliability studies. We use the VISA-P as one illustrative example, but our paper is aimed at clinicians and researchers who are evaluating measurement properties of any test.
Reports and reviews of PROMs are likely to become more consistent since the publication of the consensus-based standards for the selection of health-status measurement instruments (COSMIN), a consensus statement for evaluating the methodological quality of PROM studies.10 ,11 An example of a PROM that was developed using the COSMIN guidelines is the Copenhagen Hip and Groin Outcome Score (HAGOS).12
Validity
It is common to read claims that a test is ‘valid’. Validity is, however, not an immutable characteristic of a test, but defines inferences we make based on test scores. To illustrate, consider measurement in millimetres of the circumference of a person's head using a flexible tape-measure. The measurement can be used to make a valid inference about the person's hat size, but not about intelligence. So the measurement—cranial circumference in mm—is not necessarily a ‘valid test’. We can make a valid inference about hat size, or an invalid inference about intelligence. Test validity should therefore always be described relative to the standard against which it has been validated. Table 1 shows categories or types of validity. Many types of validity evidence can be considered as ways of examining construct validity.13 Innovative approaches to validation continue to emerge and table 1 is not exhaustive. We agree with Streiner and Norman13 that debates about validity terminology are not all that useful, and that ‘all validation is a process of hypothesis testing’ (p 252). Evidence for test validity is accumulated from multiple studies and cannot be demonstrated ‘once and for all’ by a single study. A test should be validated in the particular population of interest. For clinicians, it is probably sufficient that there is evidence for the valid use of test scores in patients and settings similar to their own, and that a test has face validity for their purpose.
Construct validity
Evidence for construct validity of PROMs often involves the administration of the PROM and other tests at a single point in time (ie, cross-sectional studies). Hypotheses are tested about expected relationships between the PROM and other test scores using correlation coefficients.13 Correlation coefficients indicate the strength of the association (or agreement) between two sets of scores. Strong correlations indicate that scores on the two tests move in the same direction (as one gets bigger so does the other), and negative correlation indicates that test scores move in opposite directions. A correlation of 1.0 or −1.0 indicates perfect agreement whereas 0 indicates no agreement.
There are many correlation coefficients, but the most frequently used in construct validation studies are the Spearman Rank-order correlation coefficient (Spearman Rho) and the Pearson product–moment correlation coefficient (Pearson r). For both the coefficients, the data must be linearly related, meaning that the scores in both tests move together. Pearson r is used where the data is continuous and normally distributed. Rho is a test of ranked data and is used when either of the scales is ordinal or when data are skewed (not normally distributed).15
In their previous paper, reporting the cross-cultural adaptation of a Spanish language version of the VISA-P,16 the authors reported that the VISA-P scores correlated strongly with other measures of knee function, such as the Cincinnati and Kujala scales, correlated moderately with physical scales of the SF-36 Health Survey, but did not correlate with the non-physical scales of the SF-36. This type of validity testing is usually called convergent/discriminant validity (table 1). The same study also reported evidence for ‘known groups’ validity, in which mean VISA-P scores for people with and without knee problems were significantly different.
Predictive validity
Predictive validity and responsiveness are explored in longitudinal studies. In predictive validity studies, the PROM or some other test is administered and at some future point a separate outcome of interest is measured. For example, the Örebro Musculoskeletal Pain Questionnaire is an example of a PROM and it may be evaluated to see whether or not Örebro scores predict a specified outcome such as return to work or sickness absence.17 ,18 Predictive validity studies often report the area under the curve (AUC), the curve referring to the receiver-operating characteristic curve or ROC curve. The curve is formed from a plot of the true-positive rate (sensitivity) on the vertical axis against the false-positive rate (1-specificity) on the horizontal axis, for increasing cut-off scores. An AUC of 0.5 indicates the test performs no better than chance and 1.0 indicates perfect test accuracy. Readers are directed to Streiner and Norman's13 excellent text for more information on ROC curves. A useful feature of the ROC curve is that the cut-off score that provides the best trade-off between sensitivity and specificity can be identified. The cut-off score can then be used to identify individuals considered at more or less risk of the outcome so that treatment may be adjusted accordingly. In their study of the Örebro, Linton et al18 compared the ability of the short and long forms of the Örebro to predict who would have at least 14 days sick leave. The long and short versions had similar AUC in an occupational sample (0.72 and 0.70) and a primary care sample (0.84 and 0.81). A cut-off score of 90 on the long version, and 50 on the short form were nominated as identifying most of the people in occupational and primary care samples who went on to have at least 14 days of sick leave.
Responsiveness
Responsiveness is “the ability of an … instrument to detect change over time in the construct to be measured” (ref. 14, p 743) and can be conceptualised as the validity of the change in PROM scores over time, or longitudinal construct validity.14 ,19 Responsiveness is a concept that has given rise to considerable debate and disagreement. Statistical indices and approaches to demonstrating this measurement property proliferated and produced inconsistent results.20 The recent consensus position of COSMIN is that responsiveness is best approached by testing hypotheses about the expected strength of association between PROM change scores with change scores on other tests or measures of the construct.21 In contrast, Norman et al19 recommend that studies should restrict themselves to the Cohen effect size, where the mean change is divided by the SD of the baseline scores.
In responsiveness studies, the PROM is typically readministered after a period of time during which change in the construct of interest is expected. In the absence of a ‘gold standard’ measure of the construct with which the PROM change scores could be compared, researchers may use a ‘global rating of change’ (GROC) scale, completed by the participants, at the same time the PROM is readministered. Patient-reported estimates of change provide a direct estimate of the magnitude and direction of change in health status.22 The use of GROC scales in validation studies has been criticised because the GROC is not independent of the PROM, leading to a somewhat circular argument wherein one subjective report is used as a reference standard against which another subjective report is judged.23 In addition, patients’ global rating of change is influenced by their current health status.22 However, in the absence of a ‘gold standard’ of something that cannot be directly observed, it will no doubt continue to appear in the responsiveness literature, and was used in the VISA-P responsiveness study by Hernandez-Sanchez et al.6
Responsiveness evidence might be provided by the strength of association (correlation) between PROM change scores and change on another measure of the same construct or ratings on a GROC. It is also explored using the AUC statistic which requires that people in the sample be categorised as having improved or not improved by a clinically important amount (usually based on the GROC). In the Hernandez-Sanchez study, the VISA-P change scores correlated strongly (Spearman's Rho=0.852) with a 15-level global rating of change scale, and the AUC indicated accuracy of change scores to distinguish between ‘improved’ and ‘not improved’ people in the sample (categorised using GROC scores).
These statistics are, however, of limited utility to clinicians, who need guidance on how to interpret the PROM scores and change-scores of the individuals they are working with. Closely related to the concept of responsiveness, and arguably more interpretable, is the concept of the MCID. The MCID is the smallest change in the PROM score that would typically be perceived by people as representing a meaningful improvement in their condition.24 This concept is also known as the minimum important change or minimum important difference. Unfortunately, this concept too is a topic of considerable debate and disagreement about the most suitable methods of determining the MCID.23 ,25
The Hernandez-Sanchez study of the VISA-P6 reported an MCID of between 11 and 19 points, depending on how ‘important change’ is classified. For researchers, the MCID has utility for planning the number of participants needed in a controlled trial to detect a difference between the groups that would be considered clinically important. This is important to researchers because small differences between group mean scores may be statistically significant, but too trivial to be considered important.
We suggest that the MCID provides a guide to the average magnitude of change in score that would typically be considered a meaningful improvement. MCIDs for the VISA-P in this study were derived from average GROC scores. A person-specific estimate of important change can be gathered directly from an individual's response to a GROC, such as the one used by Hernandez-Sanchez et al.6 In the therapist–client encounter, important change can be based on the whole clinical picture for the individual as well as the person's overall rating of change. The change that individuals judge to be important is likely to be highly variable.
There is another interpretable value that clinicians need to consider when evaluating change in a PROM score—the MDC. The MDC is the amount of change in the PROM score that needs to occur to be able to say (with a given level of confidence) that the score change exceeds expected errors in measurement. The MDC is derived from test–retest reliability studies where the PROM is administered on two occasions between which the variable being measured has remained stable.
Reliability
Just as test measurements are only valid for a specified purpose, reliability is also a function of the intended use of measurements. A test cannot really be said to ‘be reliable’, but measurements can be classified as adequately reliable for specific applications. Reliability can be conceptualised as the repeatability of measurements—if a test is administered on more than one occasion (and the variable being measured has not changed), how close are the repeated measurements? A measurement that is highly inconsistent contains a lot of ‘noise’ or error.13
The amount of error that is tolerable depends on the purpose of the measurement. Minimising measurement error is desirable because change in a PROM score needs to exceed the measurement error (noise) before one can be confident that real change has occurred. Sources of error in reliability estimates for PROMs include variability in the health construct of interest and variability in responses by persons taking the test. Ambiguity in items and scales may also amplify ‘noise’ or error in scores.
Reliability studies are typically conducted by repeated administration of the test over a period of time when the underlying construct (eg, function) would not be expected to change.13 The researcher can investigate variations in repeated measurements taken by a single assessor (intrarater reliability) or by different assessors (inter-rater reliability). The two measurements are separated by a specified time interval. The reliability of PROMs is termed test–retest reliability.
The way measurement errors are estimated and reported are similar, regardless of conditions under which measurements are repeated.26 Relative score stability is commonly reported using correlation indices such as Pearson's r or intraclass correlation coefficients (ICC) with a 95% CI that indicates the uncertainty about the true correlation.13 Reliability coefficients can range from 0 to 1.0, with 1.0 indicating that the relationship between repeated measurements is predicted without error. The coefficient, however, has little utility for clinicians in practice settings. What is more helpful is knowledge of the magnitude of error (in the units of measurement, eg, VISA-P units) that must be allowed for in test-score interpretation. Error estimates expressed in the scale units of the instrument can be described in a number of related ways.13
When a person completes the VISA-P, we can say that the person's ‘true’ score is the observed value, +/− error in the measurement. The SEM is frequently reported and provides an estimate of one SD of the error associated with a single measurement. As variations in measurements that occur when no real change has occurred are typically normally distributed, the SEM provides an indication of the uncertainty around the person's observed score. There is a 68% likelihood that the person's true score is within ±1 SEM of the observed score. We can increase our confidence in decisions about the likely true boundaries of a measurement by estimating the 95% CI; this is quickly approximated by multiplying the SEM by 2. The SEM (and associated 95% CI) can be used to estimate the error that must be allowed for in interpreting measurements taken on a single occasion (discussed in detail by Weir8).
When this information is not provided by authors, it can readily be estimated. The formula for calculating the SEM is SD × √(1−r), where r is the reliability coefficient and SD is the standard deviation of test scores for a sample.13 Table 2 compares test–retest reliability studies of the VISA-P. Reported and estimated SEM values vary from 1 VISA-P unit for the English version5 to 12 for the Dutch version.28 We suggest that clinicians use values that are derived from a sample similar to their clinical population, and that have used a retest period short enough that actual change in the condition was unlikely, and long enough that people were unlikely to recall their previous answers. Longer retest periods28 typically yield lower reliability coefficients as the condition is likely to have changed. Some researchers overcome this problem by including, in the reliability analysis, only those subjects who self-rate their condition as unchanged at the second test occasion.30
The SEM of 4 VISA-P points reported by Hernandez-Sanchez6 seems reasonable because the sample was entirely of people with patellar tendinopathy and a retest period long enough that people would be unlikely to recall their previous answers, but short enough that the condition could reasonably be assumed to be unchanged. A reasonable clinical interpretation is that, in 68% of cases, a person's ‘true’ VISA-P score is within ±4 points of the observed score.
When an individual completes the PROM on a second occasion, and change in their condition is of interest, estimates of the error in the change or difference score are applied. PROM reliability studies may report this as the MDC at 68%, 90% or 95% confidence. This is the magnitude of change that is required before we can be confident (at the stated level) that change exceeds the error in the repeated measurements.
Although many authors do not provide this information, it can readily be estimated from the SEM. If the SEM is the error around a single observation, then SEM multiplied by 1.414 (the square root of 2) provides the error around repeated measures. This is the MDC at 68% confidence or MDC68. If we want to be 90% confident, we multiply the MDC68 by 1.64 and, for 95% confidence, we multiply the MDC68 by 1.96.
Hernandez-Sanchez et al6 report an MDC95 of 11 points. Using the SEM of 4.0 points, we can also calculate the MDC68 (SEM × √2) and MDC90 (MDC68×1.64) of 5.7 and 9.3 points. This can be interpreted as meaning that once a person's score has improved by about 6 points we can be 68% confident that real change has occurred, at about 9 points improvement, we can by 90% confident, and with 11 points improvement, we are almost certain (95%) that the observed change is not a measurement error.
Another, and a more simple, way to summarise the magnitude of error around a change in PROM score is to use the SD of the mean change scores for the sample, if this has been reported. This yields a comparable estimate of the MDC68, if change scores are normally distributed.31
Another method of reporting test–retest reliability is the Bland Altman method which graphs the difference-scores and argues for the use of 95% CIs to describe ‘limits of agreement’ between repeated measurements.31 This provides estimates comparable to the MDC95. The graph provides a useful visual presentation of the distribution of difference-scores26 and a method for checking that errors are relatively stable across total scores.
Limitations of PROMs
Practical limitations to the use of PROMs are that they cannot be used for individuals who are cognitively impaired, and they require literacy in the language of the questionnaire. Whereas, in such cases, PROMs can also be administered by interview, some studies report that patients rate their health status better when the questionnaire is administered by interview than when the PROM is self-completed.32 ,33 Interviewer effects are more pronounced in face-to-face than in telephone interviews34 and when collecting sensitive personal information, such as when asking questions relating to mental health.35 There may be a discrepancy between self-rated physical activity and observed performance36–38 and PROMs are an adjunct to, rather than a replacement for, direct measurement of performance.
Conclusion
PROMs may be useful clinical tools to capture patients’ perceived health status and may provide a useful method for clinicians to monitor change in aspects of health status over time. Evidence for the valid use of test scores are provided by studies of people like those to whom the test will be applied in practice. The SEM and MDC allow individual patient's scores and change scores to be interpreted within the bounds of measurement error. Patient perceptions of the magnitude and importance of change can be determined using a global rating scale.
References
Footnotes
-
Contributors MD was involved in the conception of the paper, drafting and revising it critically for important intellectual content and final approval. Contributor responsible for overall content as guarantor. JK was involved in the conception of the paper, drafting and revising it critically for important intellectual content, and final approval.
-
Funding None.
-
Competing interests None.
-
Provenance and peer review Not commissioned; externally peer reviewed.