Original Article
When to use agreement versus reliability measures

https://doi.org/10.1016/j.jclinepi.2005.10.015Get rights and content

Abstract

Background

Reproducibility concerns the degree to which repeated measurements provide similar results. Agreement parameters assess how close the results of the repeated measurements are, by estimating the measurement error in repeated measurements. Reliability parameters assess whether study objects, often persons, can be distinguished from each other, despite measurement errors. In that case, the measurement error is related to the variability between persons. Consequently, reliability parameters are highly dependent on the heterogeneity of the study sample, while the agreement parameters, based on measurement error, are more a pure characteristic of the measurement instrument.

Methods and Results

Using an example of an interrater study, in which different physical therapists measure the range of motion of the arm in patients with shoulder complaints, the differences and relationships between reliability and agreement parameters for continuous variables are illustrated.

Conclusion

If the research question concerns the distinction of persons, reliability parameters are the most appropriate. But if the aim is to measure change in health status, which is often the case in clinical practice, parameters of agreement are preferred.

Introduction

Outcome measures in medical sciences may concern the assessment of radiographs and other imaging techniques, biopsy readings, the results of laboratory tests, the findings of physical examinations, or the scores on questionnaires collecting information, for example, on functional limitations, pain coping styles, and quality of life. An essential requirement of all outcome measures is that they are valid and reproducible or reliable [1], [2].

Reproducibility concerns the degree to which repeated measurements in stable study objects, often persons, provide similar results. Repeated measurements may differ because of biologic variation in persons, because even stable characteristics often show small day-to-day differences, or follow a circadian rhythm. Other sources of variation may originate from the measurement instrument itself, or the circumstances under which the measurements take place. For instance, some instruments may be temperature dependent, or the mood of a respondent may influence the answers on a questionnaire. Measurements based on assessments made by clinicians may be influenced by intrarater or interrater variation.

This article first presents an example of an interrater study, then describes the concepts underlying various reproducibility parameters, which can be distinguished in reliability and agreement parameters. The primary aim of this article is to demonstrate the relationship and the important difference between parameters of reliability and agreement, and to provide recommendations for their use in medical sciences.

Section snippets

An example

In an interrater study on the range of motion of a painful shoulder, different reproducibility parameters were used to present the results [3]. To assess the limitations in passive glenohumeral abduction movement, the range of motion of the arm was measured with a digital inclinometer, and expressed in degrees. Two physical therapists (PTA and PTB) measured the range of motion of the affected and the nonaffected shoulder in 155 patients with shoulder complaints. Table 1 presents the results in

Conceptual difference between agreement and reliability parameters

In the literature, agreement and reliability parameters are often used interchangeably, although some authors have pointed out the differences [6], [7].

Agreement and reliability parameters focus on two different questions:

  • 1.

    “How good is the agreement between repeated measurements?” This concerns the measurement error, and assesses exactly how close the scores for repeated measurements are.

  • 2.

    “How reliable is the measurement?” In other words, how well can patients be distinguished from each other,

Agreement parameters are neglected in medical sciences

In the 1980s Guyatt et al. [8] clearly emphasized the distinction between reliability and agreement parameters. They explained that reliability parameters are required for instruments that are used for discriminative purposes and agreement parameters are required for those that are used for evaluative purposes. With a hypothetic example they eloquently demonstrated that discriminative instruments require a high level of reliability: that is, the measurement error should be small in comparison

Relationship between the agreement and reliability parameters

The relationship between parameters of agreement and reliability can best be illustrated by elaborating on the variances that are involved in the ICC formulas. Therefore, we first need to explain the meaning of the variance components [12]. Variance (σ2) is the statistical term that is used to indicate variability.

The variance in observed scores can be subdivided into the variance in the objects under study, in our example the persons (σ2p), the variance in observers (the two different PTs) (σ2

Illustration of ICC and SEM calculations in the example

Table 2 presents the values of the variance components for the affected and nonaffected shoulder. The variance components are estimated with SPSS (version 10.1), with the range of motion values as independent variable and persons and PTs as random factors, using the restricted maximum likelihood method. From these variance components, the above-mentioned SEMs can be calculated. For the affected shoulder:SEMagreement_AB=(σpt_AB2+σresidual2)=(0+49.98)=7.07°SEMconsistency_AB=σresidual2=49.98=

Three ways to obtain SEM values

To facilitate and encourage the use of agreement parameters we will demonstrate how agreement parameters can be derived from the ICC formula, or can be calculated in other ways.

  • 1.

    SEM values can easily be derived from the ICC formula, if all variance components are presented. In that case, the reader can calculate the ICC of his/her own choice. SEM is calculated as σerror2, which equals (σpt2+σresidual2), if one wishes to take the systematic differences between the PTs into account, otherwise,

Typical parameters for agreement and reliability

For repeated measurements on a continuous scale, as in our example, an ICC is the most appropriate reliability parameter. An extensive overview of the various ICC formulas is provided by McGraw and Wong [11].

In our example, agreement was expressed as the percentage of observations lying between predefined values (Table 1). Presentation in this way makes sense in clinical practice, because every PT knows what 5° and 10° means. This measure was chosen because it can easily be interpreted by PTs

Clinical interpretation

Agreement parameters are expressed on the actual scale of measurement, and not as reliability parameters as a dimensionless value between 0 and 1. This is an important advantage for clinical interpretation. If weights are measured in kilograms, the dimension of the SEM is kilograms. For example, if we know that a weighing scale has a SEM of 300 g, we know that we can use it to monitor adult body weight because changes of less than 1 kilogram are not important. The smallest detectable change

Conclusion

In this article we have shown the important difference between the parameters of reliability and agreement and their relationship. Agreement parameters will be more stable over different population samples than reliability parameters, as we observed in our shoulder example, in which the SEM was quite similar for the affected and the nonaffected shoulder. Reliability parameters are highly dependent on the variation in the population sample, and are only generalizable to samples with a similar

References (14)

There are more references available in the full text version of this article.

Cited by (1316)

View all citing articles on Scopus
View full text