Original articles
Meta-Analysis of Diagnostic Tests With Imperfect Reference Standards

https://doi.org/10.1016/S0895-4356(99)00086-4Get rights and content

Abstract

We present a method to estimate the summary receiver operating characteristic (SROC) curve for combining information on a diagnostic test from several different studies. Unlike previous methods that assume the reference standard to be error free, our approach allows for the possibility of errors in the reference standard, through use of a latent class model. The model provides estimates of the sensitivity and specificity of the diagnostic test and the case prevalence in each study; these parameters can then be used in a meta-analysis, for example, using the regression method proposed by Moses et al., of a measure of test discrimination on a measure of the diagnostic threshold, to fit the SROC. The method is illustrated with an example on Pap smears that shows how adjusting for imperfection in the reference standard typically reduces the scatter of data in the SROC plot, and tends to indicate better performance of the test than otherwise.

Introduction

Meta-analysis is becoming increasingly popular as a method of summarizing the results of studies. As in the meta-analysis of data from clinical trials, the meta-analysis of diagnostic data entails more than simply pooling all the data into a fourfold table. Simple pooling may cause serious bias because of confounding of disease prevalence and test thresholds used in the contributing studies 1, 2.

Although meta-analysis has been applied mostly to randomized trials, specific regression methods have been developed more recently for meta-analysis of diagnostic tests [3]. A second method is to calculate weighted averages of sensitivity and specificity separately [4], but it has been recognized that these estimates will probably be negatively biased [1]. It appears preferable to estimate sensitivity and specificity simultaneously in some way, to allow an assessment of overall test performance.

A simple and appropriate method is to plot the true-positive rate (TPR, or sensitivity) against the false-positive rate (FPR, or 1 − specificity) for each study and to fit a Summary Receiver Operating Characteristic Curve (SROC) 3, 5. The method is intuitively appealing because it takes account of the fact that different studies may have explicitly or implicitly used different test thresholds to differentiate “positive” from “negative” tests, so that those studies choosing less extreme thresholds may have higher TPRs (better sensitivity) at the expense of higher FPRs (poorer specificity). While this method of SROC estimation is probably the most widely used, several other meta-analytic methods have been described: latent scale logistic regression [6], a weighted combination of odds ratios [7] that can incorporate continuous tests, a method for combining the areas under ROC curves [8], and an ordinal regression method [1].

In meta-analysis of diagnostic tests, as in most primary studies, a usual assumption is that the test is being compared to a reference standard which is error free. However, there is a substantial body of literature documenting that reference standards often—perhaps usually—have errors 9, 10, 11. For example, Fleming [12] has recently summarized the levels of agreement in nine published studies of histopathology. The kappa values ranged from 0.009 to 0.68, with only one having kappa greater than 0.6, a value which might be considered as indicating good agreement. Since histopathology is commonly taken as the reference standard, it is clear that this will often be imperfect.

Such referent errors can have serious consequences for the estimation of test accuracy. In the simplest case, when the referent and test errors are independent, failure to recognize the reference standard errors causes an underestimation of test performance characteristics. The degree of underestimation is a nonlinear function of prevalence, and hence there is an apparent dependence of test performance on the prevalence of the condition 13, 14, 15, 16. In contrast, allowing for errors in the reference standard usually better reflects the clinical situation; for instance, even though biopsy may be regarded as the best available evidence for cervical intraepithelial neoplasia following an abnormal Pap smear test, there is still uncertainty and error associated with the biopsy result.

Correcting for referent error is relatively straightforward when accurate estimates of it are available [14]. However, this is not usually the case. Several methods have been described to correct for referent error for which there are no estimates. These include frequentist latent class methods 17, 18, 19, Bayesian methods [20], and approaches using “fuzzy gold standards” [21].

We have identified only one article [22] which has corrected for unknown referent errors while pooling data from several studies. It used an EM algorithm to derive estimates of test characteristics using data from a series of several studies, each of which involved two tests from a set of four. However, this approach does not take account of the possibility of different test thresholds in the various studies, and does not extend to the construction of a SROC curve.

In this article, we develop a method that recognizes that the reference standard may be imperfect and allows for between-study differences in test error rates, which may be related to differences in their definition of diagnostic threshold. We estimate an adjusted SROC curve that takes such referent errors into account. Adjusting the SROC curve in this way will tend to reduce the bias in the estimated test performance characteristics. This tendency is illustrated by calculating SROCs before and after adjustment in a numerical example.

Section snippets

Methods

Our proposed method has the primary goal of estimating the SROC curve while taking errors in the referent into account. It consists of three main steps. First, we develop a latent class model for the diagnostic data arising from a set of studies being used in a meta-analysis. This model yields estimates of the disease prevalence in each study, and overall estimates of test sensitivity and specificity. The possibility of error in the reference standard is incorporated into the model. Second, we

Example

Our example concerns a meta-analysis of the performance of the Papanicolau (Pap) smear for the diagnosis of cervical precancer. The data set consists of all 59 studies comparing Pap test results with histology that could be identified by extensive literature searching, and that provided the necessary data on sensitivity and specificity 1, 28. The median sample size of studies was 127, with an interquartile range of 87–300. Estimates of sensitivity and specificity ranged from 11% to 99% and 14%

Discussion

Our method has combined the results of a latent class model with the conventional regression method of estimating the SROC curve for a diagnostic test. In doing so, it is the first approach of which we are aware which permits the meta-analysis of diagnostic test data that allows for the possibility of errors in the referent standard, while reflecting differences in error rates and differences in diagnostic threshold between studies.

Allowing for errors in the referent standard is potentially

Acknowledgements

Dr. Walter holds a National Health Scientist Award from Health Canada.

References (33)

  • A.S. Midgette et al.

    A meta-analytic method for summarizing diagnostic test performance

    Med Decis Making

    (1993)
  • L. Irwig et al.

    Guidelines for meta-analyses evaluating diagnostic tests

    Ann Intern Med

    (1994)
  • C.M. Rutter et al.

    Regression methods for meta-analysis of diagnostic test data

    Academic Radiol

    (1995)
  • V. Hasselblad et al.

    Meta-analysis of screening and diagnostic tests

    Psychol Bull

    (1995)
  • D.K. McClish

    Combining and comparing area estimates across studies or strata

    Med Decis Making

    (1992)
  • P. de Neef

    Evaluating rapid test for streptococcal pharyngitisThe apparent accuracy of a diagnostic test when there are errors in the standard of comparison

    Med Decis Making

    (1987)
  • Cited by (96)

    • Classifying Injuries in Young Children as Abusive or Accidental: Reliability and Accuracy of an Expert Panel Approach

      2018, Journal of Pediatrics
      Citation Excerpt :

      Our expert panel approach, although reliable and accurate, is no exception. Statistical methods exist for evaluating new clinical decision rules when criterion standards are imperfect.25-29 A common feature of these methods is that the criterion standard imprecision propagates to the evaluation of a new decision rule.

    • Summary diagnostic validity of commonly used maternal major depression disorder case finding instruments in the United States: A meta-analysis

      2016, Journal of Affective Disorders
      Citation Excerpt :

      Combined, these issues preclude definitive conclusions regarding the diagnostic validity of maternal MDD case-finding instruments. Meta-analysis techniques that account and adjust for the above issues exist; (Sadatsafavi et al., 2010; Chu et al., 2009; Walter et al., 1999; Dendukuri et al., 2012; Bernatsky et al., 2005; Dendukuri and Joseph, 2001) however, such methods have not yet been applied to maternal MDD diagnostic accuracy studies. The objective of this study was to conduct meta-analyses to estimate the diagnostic validity of commonly used maternal MDD case finding instruments in the US while accounting for 1) varying diagnostic thresholds, 2) use of multiple imperfect reference standards to validate the same case-finding instrument, 3) and the potential for conditional dependence of errors generated from case-finding instrument and reference standard results.

    • Biomarkers of Liver Fibrosis

      2013, Advances in Clinical Chemistry
      Citation Excerpt :

      Although it will remain important in diagnosis of unexplained liver diseases, its future role is less certain due to the advent of novel noninvasive tests for liver fibrosis [9]. Because of limitations inherent to any methodology including the gold standard, the observed test sensitivity and specificity will likely be underestimated [34–36]. As such, biopsy error could make it impossible to distinguish an effective versus inadequate surrogate biomarker [37–39].

    View all citing articles on Scopus
    View full text