CCBY Open access
Research Methods & Reporting

STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies

BMJ 2015; 351 doi: (Published 28 October 2015) Cite this as: BMJ 2015;351:h5527

Re: STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies

Dear Editor,

We read with great interest the manuscript of Bossuyt, et. al, [1] regarding the STARD 2015 and its updates [2]. We also believe that diagnostic accuracy studies are prone to bias in cases of methodological deficiency and improper statistical analysis. Thus, it is indeed necessary to introduce a general and extensive standard to reduce such a bias. We propose the following vital amendments to the updated STARD standard to improve its quality in clinical studies.

The performance of a diagnosis method must be assessed based on different criteria [3]. In fact, each performance index looks at the diagnosis accuracy at a different site. For example, Sensitivity (Se) identifies the statistical power of the test. Specificity (Sp) is related to the Type I error. Not only the Se and Sp but also the precision (Positive Predictive Value: PPV) are highly influenced by the prevalence of disease in the population (P) ]; Supplementary material S4). As an example, in a diagnostic test with Se and Sp of 80% and 95%, when the prevalence of disease is 10%, PPV will be 64%. This means that a person with positive test results is really sick with a probability of 64%. The prevalence of disease is usually not high resulting in lower values of precision. Moreover, in a reliable diagnostic test, it is necessary to fulfill the condition of Type I and II errors [4], PPV [5], and DOR [3, 6].

1s. In the STARD standard, the authors mention that it is necessary to report at least one measure of accuracy (such as Se, Sp, predictive values, or Area Under the Curve: AUC) in the title or abstract (Item no. 1; [1, 2] ]). We believe that the entire Se, Sp and PPV should be reported in the abstract. Alternatively, when the value of parameter P is shown, PPV can be estimated from the other parameters and could be thus neglected ([6]; Supplementary material S4).

2s. Moreover, methods for estimating or comparing measures of diagnostic accuracy must be described in the STARD standard (Item no. 14; [1, 2]). We believe that reporting multiple indices could not only rigorously validate the test but also make it easier to compare different diagnostic tests. A set of Se, Sp, Accuracy, PPV, F-score, AUC, Matthews Correlation Coefficient (MCC), DOR, DP (Discriminant Power) and Kappa could be proposed (an example of such a report is in [6]). Although related, each of these has a particular interpretation, thus the proposed test could be evaluated from different perspectives.

3s. Meanwhile, when different diagnostic accuracy indices are compared with the gold standard, the superiority of one test to another must be proved using proper statistical tests (e.g. McNemar's test [7]). Otherwise, minor insignificant and random improvements might be erroneously reported as substantial.

4s. It is shown in the literature that the performance indices used for binary diagnostic tests are not suitable in multi-class cases where there are more than two groups (rather than healthy or unhealthy) in the gold standard data. For example, the overall accuracy is biased. Proper indices, such as micro- or macro- averaged performance indices, should be reported in these cases [8].

5s. Intended sample size and its estimation method should be reported along with the STARD standard (Item no. 18; [1, 2]). This is an important point in the experimental design in order to generalize the proposed diagnostic test to the whole population. A unique framework for sample size estimation for different indices of accuracy proposed by Hajian-Tilaki[9], could be used, for instance.

6s. It is possible to extend the STARD standard to validation of computer-aided diagnosis (CAD) systems. The purpose of CAD is to improve the diagnosis accuracy. In such systems, a rule-based or black-box mathematical system is trained (i.e., its internal parameters are tuned) and then used for medical diagnosis. Thus, the training and test sets must be different to avoid the overestimation of accuracy. Moreover, the samples of those sets must be randomly permuted in each validation repetition to guard against testing hypotheses suggested by the data (Type III errors [6, 10]). Thus, cross-validation methods are basically preferred to hold-out validation. Accordingly, in addition to the validation method, the average and dispersion of performance indices over the validation procedure should be reported in the abstract and results sections to reflect the accuracy and repeatability of the results over the entire population.


[1] Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Lijmer JG, Moher D, Rennie D, Vet HC. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ l2015;351.
[2] Korevaar DA, Cohen JF, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Moher D, de Vet HCW, Altman DG, Hooft L, Bossuyt PMM. Updating standards for reporting diagnostic accuracy: the development of STARD 2015. Research Integrity and Peer Review l2016;1: 7.
[3] Ghosh AK, Wittich CM, Rhodes DJ, Beckman TJ, Edson RS, McCallum DK. Mayo clinic internal medicine review: CRC Press; 2008.
[4] Ellis PD. The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results: Cambridge University Press; 2010.
[5] Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science l2014;1.
[6] Mohebian MR, Marateb HR, Mansourian M, Mañanas MA, Mokarian F. A Hybrid Computer-aided-diagnosis System for Prediction of Breast Cancer Recurrence (HPBCR) Using Optimized Ensemble Learning. Computational and Structural Biotechnology Journal l2017;15: 75-85.
[7] Dietterich TG. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput l1998;10: 1895-1923.
[8] Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing & Management l2009;45: 427-437.
[9] Hajian-Tilaki K. Sample size estimation in diagnostic test studies of biomedical informatics. Journal of Biomedical Informatics l2014;48: 193-204.
[10] Mosteller F. A k-Sample Slippage Test for an Extreme Population. In: Fienberg SE, Hoaglin DC, editors. Selected Papers of Frederick Mosteller. New York, NY: Springer New York; 2006, p. 101-109.

Competing interests: None declared.

Acknowledgements: The authors would like to thank Kevin McGill for reviewing a draft of this paper.

Funding Support: This work was supported by the People Programme (Marie Curie Actions) of the European Union Seventh Framework Programme (FP7/2007–2013) under REA grant agreement no. 600388 (TECNIOspring Programme), from the Agency for Business Competitiveness of the Government of Catalonia, ACCIÓ and fromS panish Ministry of Economy and Competitiveness- Spain (project DPI2014-59049-R).

Competing interests: No competing interests

21 April 2017
Hamid Reza Marateb
Professor, member of the BIOsignal Analysis for Rehabilitation and Therapy Research Group (BIOART)
Marjan Mansourian (Isfahan University of Medical Sciences), Miguel Angel Mañanas (UniversitatPolitècnica de Catalunya, BarcelonaTech (UPC))
Biomedical Engineering Research Center, Department of Automatic Control, UniversitatPolitècnica de Catalunya, BarcelonaTech (UPC)
C. Pau Gargallo, 5, 08028 Barcelona, Spain.