Original articlesMeta-Analysis of Diagnostic Tests With Imperfect Reference Standards
Introduction
Meta-analysis is becoming increasingly popular as a method of summarizing the results of studies. As in the meta-analysis of data from clinical trials, the meta-analysis of diagnostic data entails more than simply pooling all the data into a fourfold table. Simple pooling may cause serious bias because of confounding of disease prevalence and test thresholds used in the contributing studies 1, 2.
Although meta-analysis has been applied mostly to randomized trials, specific regression methods have been developed more recently for meta-analysis of diagnostic tests [3]. A second method is to calculate weighted averages of sensitivity and specificity separately [4], but it has been recognized that these estimates will probably be negatively biased [1]. It appears preferable to estimate sensitivity and specificity simultaneously in some way, to allow an assessment of overall test performance.
A simple and appropriate method is to plot the true-positive rate (TPR, or sensitivity) against the false-positive rate (FPR, or 1 − specificity) for each study and to fit a Summary Receiver Operating Characteristic Curve (SROC) 3, 5. The method is intuitively appealing because it takes account of the fact that different studies may have explicitly or implicitly used different test thresholds to differentiate “positive” from “negative” tests, so that those studies choosing less extreme thresholds may have higher TPRs (better sensitivity) at the expense of higher FPRs (poorer specificity). While this method of SROC estimation is probably the most widely used, several other meta-analytic methods have been described: latent scale logistic regression [6], a weighted combination of odds ratios [7] that can incorporate continuous tests, a method for combining the areas under ROC curves [8], and an ordinal regression method [1].
In meta-analysis of diagnostic tests, as in most primary studies, a usual assumption is that the test is being compared to a reference standard which is error free. However, there is a substantial body of literature documenting that reference standards often—perhaps usually—have errors 9, 10, 11. For example, Fleming [12] has recently summarized the levels of agreement in nine published studies of histopathology. The kappa values ranged from 0.009 to 0.68, with only one having kappa greater than 0.6, a value which might be considered as indicating good agreement. Since histopathology is commonly taken as the reference standard, it is clear that this will often be imperfect.
Such referent errors can have serious consequences for the estimation of test accuracy. In the simplest case, when the referent and test errors are independent, failure to recognize the reference standard errors causes an underestimation of test performance characteristics. The degree of underestimation is a nonlinear function of prevalence, and hence there is an apparent dependence of test performance on the prevalence of the condition 13, 14, 15, 16. In contrast, allowing for errors in the reference standard usually better reflects the clinical situation; for instance, even though biopsy may be regarded as the best available evidence for cervical intraepithelial neoplasia following an abnormal Pap smear test, there is still uncertainty and error associated with the biopsy result.
Correcting for referent error is relatively straightforward when accurate estimates of it are available [14]. However, this is not usually the case. Several methods have been described to correct for referent error for which there are no estimates. These include frequentist latent class methods 17, 18, 19, Bayesian methods [20], and approaches using “fuzzy gold standards” [21].
We have identified only one article [22] which has corrected for unknown referent errors while pooling data from several studies. It used an EM algorithm to derive estimates of test characteristics using data from a series of several studies, each of which involved two tests from a set of four. However, this approach does not take account of the possibility of different test thresholds in the various studies, and does not extend to the construction of a SROC curve.
In this article, we develop a method that recognizes that the reference standard may be imperfect and allows for between-study differences in test error rates, which may be related to differences in their definition of diagnostic threshold. We estimate an adjusted SROC curve that takes such referent errors into account. Adjusting the SROC curve in this way will tend to reduce the bias in the estimated test performance characteristics. This tendency is illustrated by calculating SROCs before and after adjustment in a numerical example.
Section snippets
Methods
Our proposed method has the primary goal of estimating the SROC curve while taking errors in the referent into account. It consists of three main steps. First, we develop a latent class model for the diagnostic data arising from a set of studies being used in a meta-analysis. This model yields estimates of the disease prevalence in each study, and overall estimates of test sensitivity and specificity. The possibility of error in the reference standard is incorporated into the model. Second, we
Example
Our example concerns a meta-analysis of the performance of the Papanicolau (Pap) smear for the diagnosis of cervical precancer. The data set consists of all 59 studies comparing Pap test results with histology that could be identified by extensive literature searching, and that provided the necessary data on sensitivity and specificity 1, 28. The median sample size of studies was 127, with an interquartile range of 87–300. Estimates of sensitivity and specificity ranged from 11% to 99% and 14%
Discussion
Our method has combined the results of a latent class model with the conventional regression method of estimating the SROC curve for a diagnostic test. In doing so, it is the first approach of which we are aware which permits the meta-analysis of diagnostic test data that allows for the possibility of errors in the referent standard, while reflecting differences in error rates and differences in diagnostic threshold between studies.
Allowing for errors in the referent standard is potentially
Acknowledgements
Dr. Walter holds a National Health Scientist Award from Health Canada.
References (33)
- et al.
Meta-analytic methods for diagnostic test accuracy
J Clin Epidemiol
(1995) The discrepancy in discrepant analysis
Lancet
(1996)- et al.
Inter-observer variation in cytological and histological diagnoses of cervical neoplasia and its epidemiologic implication
J Clin Epidemiol
(1995) - et al.
Estimation of test error rates, disease prevalence and relative risk from misclassified dataA review
J Clin Epidemiol
(1988) - et al.
Sensitivity and specificity of a diagnostic test determined by repeated observations in the absence of an external standard
J Clin Epidemiol
(1991) - et al.
Sensitivity and specificity of diagnostic tests in acute maxillary sinusitis determined by maximum likelihood in the absence of an external standard
J Clin Epidemiol
(1994) - et al.
Relative observer accuracy for dichotomized variables
J Chron Dis
(1985) - et al.
Comparing dichotomous screening tests when individuals negative on both tests are not verified
J Clin Epidemiol
(1997) Statistical Analysis for Rates and Proportions
(1981)- et al.
Combining independent studies of a diagnostic test into a summary ROC curveData-analytic approaches and some additional considerations
Stat Med
(1993)
A meta-analytic method for summarizing diagnostic test performance
Med Decis Making
Guidelines for meta-analyses evaluating diagnostic tests
Ann Intern Med
Regression methods for meta-analysis of diagnostic test data
Academic Radiol
Meta-analysis of screening and diagnostic tests
Psychol Bull
Combining and comparing area estimates across studies or strata
Med Decis Making
Evaluating rapid test for streptococcal pharyngitisThe apparent accuracy of a diagnostic test when there are errors in the standard of comparison
Med Decis Making
Cited by (96)
The performance of three nutritional tools varied in colorectal cancer patients: a retrospective analysis
2022, Journal of Clinical EpidemiologyClassifying Injuries in Young Children as Abusive or Accidental: Reliability and Accuracy of an Expert Panel Approach
2018, Journal of PediatricsCitation Excerpt :Our expert panel approach, although reliable and accurate, is no exception. Statistical methods exist for evaluating new clinical decision rules when criterion standards are imperfect.25-29 A common feature of these methods is that the criterion standard imprecision propagates to the evaluation of a new decision rule.
Summary diagnostic validity of commonly used maternal major depression disorder case finding instruments in the United States: A meta-analysis
2016, Journal of Affective DisordersCitation Excerpt :Combined, these issues preclude definitive conclusions regarding the diagnostic validity of maternal MDD case-finding instruments. Meta-analysis techniques that account and adjust for the above issues exist; (Sadatsafavi et al., 2010; Chu et al., 2009; Walter et al., 1999; Dendukuri et al., 2012; Bernatsky et al., 2005; Dendukuri and Joseph, 2001) however, such methods have not yet been applied to maternal MDD diagnostic accuracy studies. The objective of this study was to conduct meta-analyses to estimate the diagnostic validity of commonly used maternal MDD case finding instruments in the US while accounting for 1) varying diagnostic thresholds, 2) use of multiple imperfect reference standards to validate the same case-finding instrument, 3) and the potential for conditional dependence of errors generated from case-finding instrument and reference standard results.
Biomarkers of Liver Fibrosis
2013, Advances in Clinical ChemistryCitation Excerpt :Although it will remain important in diagnosis of unexplained liver diseases, its future role is less certain due to the advent of novel noninvasive tests for liver fibrosis [9]. Because of limitations inherent to any methodology including the gold standard, the observed test sensitivity and specificity will likely be underestimated [34–36]. As such, biopsy error could make it impossible to distinguish an effective versus inadequate surrogate biomarker [37–39].
Double-Negative Results Matter: A Reevaluation of Sensitivities for Detecting SARS-CoV-2 Infection Using Saliva Versus Nasopharyngeal Swabs
2024, American Journal of Epidemiology