STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies
BMJ 2015; 351 doi: https://doi.org/10.1136/bmj.h5527 (Published 28 October 2015) Cite this as: BMJ 2015;351:h5527
All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
The group is to be congratulated on their work in setting out the basics of assessing the usefulness of diagnostic tests. They are correct to state that there are other measures of diagnostic test accuracy in addition to specificity, likelihood ratio, etc. Actually, the latter indices have little if any role in assessing the usefulness of diagnostic findings for use in day-to-day clinical practice. Doctors in wards and clinics rarely if ever interpret findings by using Bayes rule with the independence assumption to create post-test probabilities from pre-test probabilities and likelihood ratios by using nomograms or computers. The resulting calculated probabilities may be very misleading [1].
Experienced doctors use reasoning by probabilistic elimination when they reason transparently. Individual diagnostic findings are often used to suggest lists of differential diagnoses. Other findings are used to differentiate between such diagnoses by being associated more often in patients with some diagnoses in the list than in others. Findings are also used on their own or in combination with other findings as ‘necessary’ and ‘sufficient’ diagnostic criteria, and as predictors of benefits or harms of treatments. These criteria and predictors of outcomes can be arrived at in an evidence-based way [1]. In many of these roles, the individual numerical result is used and findings are not divided (e.g. into ‘high’ or ‘low’) as this may obscure their usefulness.
Reference
1. Llewelyn H, Ang AH, Lewis K, Abdullah A. The Oxford Handbook of Clinical Diagnosis, 3rd edition. Oxford University Press, Oxford, 2014, pp 615 to 642.
Competing interests: No competing interests
Re: STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies
Dear Editor,
We read with great interest the manuscript of Bossuyt, et. al, [1] regarding the STARD 2015 and its updates [2]. We also believe that diagnostic accuracy studies are prone to bias in cases of methodological deficiency and improper statistical analysis. Thus, it is indeed necessary to introduce a general and extensive standard to reduce such a bias. We propose the following vital amendments to the updated STARD standard to improve its quality in clinical studies.
The performance of a diagnosis method must be assessed based on different criteria [3]. In fact, each performance index looks at the diagnosis accuracy at a different site. For example, Sensitivity (Se) identifies the statistical power of the test. Specificity (Sp) is related to the Type I error. Not only the Se and Sp but also the precision (Positive Predictive Value: PPV) are highly influenced by the prevalence of disease in the population (P) ]; Supplementary material S4). As an example, in a diagnostic test with Se and Sp of 80% and 95%, when the prevalence of disease is 10%, PPV will be 64%. This means that a person with positive test results is really sick with a probability of 64%. The prevalence of disease is usually not high resulting in lower values of precision. Moreover, in a reliable diagnostic test, it is necessary to fulfill the condition of Type I and II errors [4], PPV [5], and DOR [3, 6].
1s. In the STARD standard, the authors mention that it is necessary to report at least one measure of accuracy (such as Se, Sp, predictive values, or Area Under the Curve: AUC) in the title or abstract (Item no. 1; [1, 2] ]). We believe that the entire Se, Sp and PPV should be reported in the abstract. Alternatively, when the value of parameter P is shown, PPV can be estimated from the other parameters and could be thus neglected ([6]; Supplementary material S4).
2s. Moreover, methods for estimating or comparing measures of diagnostic accuracy must be described in the STARD standard (Item no. 14; [1, 2]). We believe that reporting multiple indices could not only rigorously validate the test but also make it easier to compare different diagnostic tests. A set of Se, Sp, Accuracy, PPV, F-score, AUC, Matthews Correlation Coefficient (MCC), DOR, DP (Discriminant Power) and Kappa could be proposed (an example of such a report is in [6]). Although related, each of these has a particular interpretation, thus the proposed test could be evaluated from different perspectives.
3s. Meanwhile, when different diagnostic accuracy indices are compared with the gold standard, the superiority of one test to another must be proved using proper statistical tests (e.g. McNemar's test [7]). Otherwise, minor insignificant and random improvements might be erroneously reported as substantial.
4s. It is shown in the literature that the performance indices used for binary diagnostic tests are not suitable in multi-class cases where there are more than two groups (rather than healthy or unhealthy) in the gold standard data. For example, the overall accuracy is biased. Proper indices, such as micro- or macro- averaged performance indices, should be reported in these cases [8].
5s. Intended sample size and its estimation method should be reported along with the STARD standard (Item no. 18; [1, 2]). This is an important point in the experimental design in order to generalize the proposed diagnostic test to the whole population. A unique framework for sample size estimation for different indices of accuracy proposed by Hajian-Tilaki[9], could be used, for instance.
6s. It is possible to extend the STARD standard to validation of computer-aided diagnosis (CAD) systems. The purpose of CAD is to improve the diagnosis accuracy. In such systems, a rule-based or black-box mathematical system is trained (i.e., its internal parameters are tuned) and then used for medical diagnosis. Thus, the training and test sets must be different to avoid the overestimation of accuracy. Moreover, the samples of those sets must be randomly permuted in each validation repetition to guard against testing hypotheses suggested by the data (Type III errors [6, 10]). Thus, cross-validation methods are basically preferred to hold-out validation. Accordingly, in addition to the validation method, the average and dispersion of performance indices over the validation procedure should be reported in the abstract and results sections to reflect the accuracy and repeatability of the results over the entire population.
References:
[1] Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Lijmer JG, Moher D, Rennie D, Vet HC. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ l2015;351.
[2] Korevaar DA, Cohen JF, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Moher D, de Vet HCW, Altman DG, Hooft L, Bossuyt PMM. Updating standards for reporting diagnostic accuracy: the development of STARD 2015. Research Integrity and Peer Review l2016;1: 7.
[3] Ghosh AK, Wittich CM, Rhodes DJ, Beckman TJ, Edson RS, McCallum DK. Mayo clinic internal medicine review: CRC Press; 2008.
[4] Ellis PD. The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results: Cambridge University Press; 2010.
[5] Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science l2014;1.
[6] Mohebian MR, Marateb HR, Mansourian M, Mañanas MA, Mokarian F. A Hybrid Computer-aided-diagnosis System for Prediction of Breast Cancer Recurrence (HPBCR) Using Optimized Ensemble Learning. Computational and Structural Biotechnology Journal l2017;15: 75-85.
[7] Dietterich TG. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput l1998;10: 1895-1923.
[8] Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing & Management l2009;45: 427-437.
[9] Hajian-Tilaki K. Sample size estimation in diagnostic test studies of biomedical informatics. Journal of Biomedical Informatics l2014;48: 193-204.
[10] Mosteller F. A k-Sample Slippage Test for an Extreme Population. In: Fienberg SE, Hoaglin DC, editors. Selected Papers of Frederick Mosteller. New York, NY: Springer New York; 2006, p. 101-109.
Competing interests: None declared.
Acknowledgements: The authors would like to thank Kevin McGill for reviewing a draft of this paper.
Funding Support: This work was supported by the People Programme (Marie Curie Actions) of the European Union Seventh Framework Programme (FP7/2007–2013) under REA grant agreement no. 600388 (TECNIOspring Programme), from the Agency for Business Competitiveness of the Government of Catalonia, ACCIÓ and fromS panish Ministry of Economy and Competitiveness- Spain (project DPI2014-59049-R).
Competing interests: No competing interests