Intended for healthcare professionals

CCBY Open access
Research

Validation of models to diagnose ovarian cancer in patients managed surgically or conservatively: multicentre cohort study

BMJ 2020; 370 doi: https://doi.org/10.1136/bmj.m2614 (Published 30 July 2020) Cite this as: BMJ 2020;370:m2614
  1. Ben Van Calster, associate professor1 2 3,
  2. Lil Valentin, professor4 5,
  3. Wouter Froyman, consultant gynaecologist1 6,
  4. Chiara Landolfo, consultant gynaecologist1 7,
  5. Jolien Ceusters, biostatistician8,
  6. Antonia C Testa, professor9 10,
  7. Laure Wynants, assistant professor1 11,
  8. Povilas Sladkevicius, associate professor4 5,
  9. Caroline Van Holsbeke, consultant gynaecologist12,
  10. Ekaterini Domali, consultant gynaecologist13,
  11. Robert Fruscio, associate professor14,
  12. Elisabeth Epstein, associate professor15 16,
  13. Dorella Franchi, consultant gynaecologist17,
  14. Marek J Kudla, consultant gynaecologist18,
  15. Valentina Chiappa, consultant gynaecologist19,
  16. Juan L Alcazar, professor20,
  17. Francesco P G Leone, consultant gynaecologist21,
  18. Francesca Buonomo, consultant gynaecologist22,
  19. Maria Elisabetta Coccia, professor23,
  20. Stefano Guerriero, professor24,
  21. Nandita Deo, consultant gynaecologist25,
  22. Ligita Jokubkiene, consultant gynaecologist4 5,
  23. Luca Savelli, consultant gynaecologist26,
  24. Daniela Fischerová, professor27,
  25. Artur Czekierdowski, professor28,
  26. Jeroen Kaijser, consultant gynaecologist29,
  27. An Coosemans, assistant professor6 8 30,
  28. Giovanni Scambia, professor9 10,
  29. Ignace Vergote, professor6 8 30,
  30. Tom Bourne, professor1 6 7,
  31. Dirk Timmerman, professor1 6
  1. 1Department of Development and Regeneration, KU Leuven, Herestraat 49 Box 805, 3000 Leuven, Belgium
  2. 2Department of Biomedical Data Sciences, Leiden University Medical Centre, Leiden, Netherlands
  3. 3EPI-Centre, KU Leuven, Leuven, Belgium
  4. 4Department of Obstetrics and Gynaecology, Skåne University Hospital, Malmö, Sweden
  5. 5Department of Clinical Sciences Malmö, Lund University, Lund, Sweden
  6. 6Department of Obstetrics and Gynaecology, University Hospitals Leuven, Leuven, Belgium
  7. 7Queen Charlotte’s and Chelsea Hospital, Imperial College, London, UK
  8. 8Laboratory of Tumour Immunology and Immunotherapy, Department of Oncology, KU Leuven, Leuven, Belgium
  9. 9Department of Woman, Child and Public Health, Fondazione Policlinico Universitario Agostino Gemelli, Istituto di Ricovero e Cura a Carattere Scientifico, Rome, Italy
  10. 10Department of Life Science and Public Health, Universita’ Cattolica del Sacro Cuore, Rome, Italy
  11. 11Department of Epidemiology, CAPHRI Care and Public Health Research Institute, Maastricht University, Maastricht, Netherlands
  12. 12Department of Obstetrics and Gynaecology, Ziekenhuis Oost-Limburg, Genk, Belgium
  13. 13First Department of Obstetrics and Gynaecology, Alexandra Hospital, Medical School, National and Kapodistrian University of Athens, Athens, Greece
  14. 14Clinic of Obstetrics and Gynaecology, University of Milan-Bicocca, San Gerardo Hospital, Monza, Italy
  15. 15Department of Clinical Science and Education, Karolinska Institutet, Stockholm, Sweden
  16. 16Department of Obstetrics and Gynaecology, Södersjukhuset, Stockholm, Sweden
  17. 17Preventive Gynaecology Unit, Division of Gynaecology, European Institute of Oncology IRCCS, Milan, Italy
  18. 18Department of Perinatology and Oncological Gynaecology, School of Health Sciences in Katowice, Medical University of Silesia, Katowice, Poland
  19. 19Department of Gynaecologic Oncology, National Cancer Institute of Milan, Milan, Italy
  20. 20Department of Obstetrics and Gynaecology, Clinica Universidad de Navarra, School of Medicine, Pamplona, Spain
  21. 21Department of Obstetrics and Gynaecology, Biomedical and Clinical Sciences Institute L. Sacco, University of Milan, Milan, Italy
  22. 22Institute for Maternal and Child Health, IRCCS Burlo Garofolo, Trieste, Italy
  23. 23Department of Experimental and Clinical Biomedical Sciences, University of Florence, Florence, Italy
  24. 24Department of Obstetrics and Gynaecology, University of Cagliari, Policlinico Universitario Duilio Casula, Monserrato, Cagliari, Italy
  25. 25Department of Obstetrics and Gynaecology, Whipps Cross Hospital, London, UK
  26. 26Department of Obstetrics and Gynaecology, University of Bologna, Bologna, Italy
  27. 27Gynaecological Oncology Centre, Department of Obstetrics and Gynaecology, First Faculty of Medicine, Charles University and General University Hospital, Prague, Czech Republic
  28. 28First Department of Gynaecological Oncology and Gynaecology, Medical University of Lublin, Lublin, Poland
  29. 29Department of Obstetrics and Gynaecology, Ikazia Hospital, Rotterdam, Netherlands
  30. 30Leuven Cancer Institute, University Hospitals Leuven, Leuven, Belgium
  1. Correspondence to: D Timmerman dirk.timmerman{at}uzleuven.be
  • Accepted 4 June 2020

Abstract

Objective To evaluate the performance of diagnostic prediction models for ovarian malignancy in all patients with an ovarian mass managed surgically or conservatively.

Design Multicentre cohort study.

Setting 36 oncology referral centres (tertiary centres with a specific gynaecological oncology unit) or other types of centre.

Participants Consecutive adult patients presenting with an adnexal mass between January 2012 and March 2015 and managed by surgery or follow-up.

Main outcome measures Overall and centre specific discrimination, calibration, and clinical utility of six prediction models for ovarian malignancy (risk of malignancy index (RMI), logistic regression model 2 (LR2), simple rules, simple rules risk model (SRRisk), assessment of different neoplasias in the adnexa (ADNEX) with or without CA125). ADNEX allows the risk of malignancy to be subdivided into risks of a borderline, stage I primary, stage II-IV primary, or secondary metastatic malignancy. The outcome was based on histology if patients underwent surgery, or on results of clinical and ultrasound follow-up at 12 (±2) months. Multiple imputation was used when outcome based on follow-up was uncertain.

Results The primary analysis included 17 centres that met strict quality criteria for surgical and follow-up data (5717 of all 8519 patients). 812 patients (14%) had a mass that was already in follow-up at study recruitment, therefore 4905 patients were included in the statistical analysis. The outcome was benign in 3441 (70%) patients and malignant in 978 (20%). Uncertain outcomes (486, 10%) were most often explained by limited follow-up information. The overall area under the receiver operating characteristic curve was highest for ADNEX with CA125 (0.94, 95% confidence interval 0.92 to 0.96), ADNEX without CA125 (0.94, 0.91 to 0.95) and SRRisk (0.94, 0.91 to 0.95), and lowest for RMI (0.89, 0.85 to 0.92). Calibration varied among centres for all models, however the ADNEX models and SRRisk were the best calibrated. Calibration of the estimated risks for the tumour subtypes was good for ADNEX irrespective of whether or not CA125 was included as a predictor. Overall clinical utility (net benefit) was highest for the ADNEX models and SRRisk, and lowest for RMI. For patients who received at least one follow-up scan (n=1958), overall area under the receiver operating characteristic curve ranged from 0.76 (95% confidence interval 0.66 to 0.84) for RMI to 0.89 (0.81 to 0.94) for ADNEX with CA125.

Conclusions Our study found the ADNEX models and SRRisk are the best models to distinguish between benign and malignant masses in all patients presenting with an adnexal mass, including those managed conservatively.

Trial registration ClinicalTrials.gov NCT01698632.

Introduction

Ovarian cancer is a gynaecological malignancy with a high mortality rate. In 2018, an estimated 295 400 women developed ovarian cancer worldwide, and 184 800 deaths were reported from the disease.1 The prognosis for women with ovarian cancer treated in oncology centres is better than for those managed in other settings.2345 Methods such as risk prediction models are needed to reliably estimate the likelihood that a mass is malignant so that patients can receive the optimal treatment. Risk prediction models can be used to individualise patient management, such as setting priorities on waiting lists for further investigations and specialist consultations, and deciding whether patients need surgery performed by surgeons who specialise in oncological surgery or whether surgery is not required. Adnexal masses judged to be benign can be safely managed with follow-up.6 If a benign mass causes symptoms it can be surgically removed in a local centre, however if malignancy is suspected the mass should be managed in an oncological referral centre.

Ultrasound based diagnostic models can be used to predict malignancy in adnexal masses. A commonly used model is the risk of malignancy index (RMI), which was developed in 1990.7 Newer models are the International Ovarian Tumour Analysis (IOTA) models: logistic regression model 1 (LR1), logistic regression model 2 (LR2), simple rules, simple rules risk model (SRRisk), and assessment of different neoplasias in the adnexa (ADNEX).891011 The performance of the IOTA models has been externally validated and compared on thousands of patients.121314 However, only three validation studies reported on model calibration, that is, the agreement between predicted malignancy risk and observed proportion of malignancy; only one compared the clinical utility of different prediction models for referring patients with an adnexal mass to an oncology centre.15161718 More importantly, all model development and validation studies, except for two small single centre validation studies,1920 recruited patients who underwent surgery with histology as the reference standard. Therefore, current evidence is limited to patients for whom the decision to operate had already been made. To select the optimal treatment, models should perform well on patients who undergo surgery and on those who are managed conservatively.

The primary aim of this study was to evaluate the performance of the IOTA models and RMI when applied on all adnexal masses irrespective of management, that is, conservative or surgical. The secondary aim was to evaluate model performance in clinically relevant subgroups.

Methods

Study design and participants

This study was conducted by using interim data from the IOTA phase 5 study (IOTA5), an international multicentre prospective cohort study.6 IOTA5 recruited consecutive patients with an adnexal mass examined by using transvaginal ultrasound, irrespective of whether patients subsequently underwent surgery or were managed conservatively with follow-up visits. Appendix 1 presents the IOTA5 protocol. IOTA5 recruitment took place from January 2012 to October 2016, but follow-up will continue until all patients receiving conservative management have been followed up for at least five years. The current interim analysis includes patients recruited until 1 March 2015 and follow-up data until 30 June 2017. Thirty six centres in 14 countries recruited patients to the study. The contributing centres were either oncology referral centres (tertiary centres with a specific gynaecological oncology unit) or other types of centre. We obtained approval from the ethics committee of the University Hospitals Leuven as the coordinating centre (B32220095331/S51375) and the local ethics committee of each contributing centre. We report the study according to the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) guidelines.21

Patients were eligible if they were aged 18 or older at recruitment and presented with at least one adnexal mass (ovarian, para-ovarian, or tubal) on ultrasound examination. Informed consent was obtained and then local clinicians examined patients following a standardised research protocol. Exclusion criteria were lesions presumed to be physiological if the largest diameter was less than 3 cm, refusal to provide informed consent, or withdrawal of informed consent. Pregnancy was not an exclusion criterion. We excluded patients if they had an adnexal mass that was already being followed up in the recruitment centre before the start of the study.

Procedures

The ultrasound examiners who recruited participants followed the standardised research protocol. They collected clinical information and performed a transvaginal ultrasound examination, and an abdominal scan if necessary. The examination consisted of scanning the uterus, both adnexa, and the whole pelvis outside these organs. Grey scale and colour or power Doppler ultrasound was used to characterise the morphology and vascularisation of the adnexal mass. Examiners collected information on several predefined ultrasound variables, including those used in the prediction models, and ultrasound results were described by using IOTA terminology.22

We had no requirements about the level of experience of the ultrasound examiners, but all examiners were IOTA trainers or had passed the IOTA certification test (https://www.iotagroup.org/certified-members). Ultrasound examiners used subjective assessment of the ultrasound images to classify lesions as benign, borderline, or malignant, and specified the degree of certainty with which the classification was made (certain, probable, or uncertain). The presumed histology was registered according to a list of 18 predefined diagnoses. These diagnoses were based on knowledge of the typical ultrasound appearance of benign, borderline, and malignant lesions, and of different types of specific adnexal pathology.23 When examiners detected multiple masses, the dominant mass was defined as the mass with the most complex ultrasound morphology. If multiple masses had similar morphology, the largest mass or the mass that was most accessible with ultrasound was denoted dominant. We used the dominant mass in our statistical analyses. The ultrasound examiner suggested surgery or conservative management based on the ultrasound diagnosis and the patient’s symptoms. Ultimately, however, the treating clinician decided upon the management strategy together with the patient. Therefore, the suggested management and actual management might be different. We encouraged centres to measure the level of serum CA125 in all patients, but this was not a requirement for inclusion in the study. Measurement of CA125 was left to clinical judgment and local protocols.

Conservative management included ultrasound and clinical follow-up at intervals of three months, six months, and then every 12 months thereafter. At follow-up visits clinical information including symptoms was collected and an ultrasound examination was performed in the same manner as at the inclusion scan. Examiners collected data on several predefined ultrasound variables and suggested a diagnosis by subjectively assessing the ultrasound images. After one or more follow-up visits, some patients underwent surgery for a variety of reasons (eg, suspicion of malignancy or patient anxiety).6 For some patients, the mass resolved spontaneously during follow-up.

Each centre performed surgery by following local protocols and histological examination of surgically removed masses. We did not carry out a central pathology review because in a previous study we did not observe important differences in reported outcomes between local and central pathology reports.8 We classified malignant tumours according to the criteria recommended by the International Federation of Gynaecology and Obstetrics.24

Data cleaning

We collected patient level data by using a secure electronic platform developed for the study (IOTA5 Study Screen; astraia software, Munich, Germany). Patients automatically received a unique identifier upon enrolment. We encrypted all data communication to ensure data security. A team of biostatisticians and ultrasound examiners performed data cleaning. Data cleaning included sending queries to participating centres to retrieve missing information or to correct inconsistencies. Local centres used a standardised questionnaire (appendix 2) to accrue missing information by telephoning patients and managing clinicians.

For the primary analysis, we excluded centres that recruited fewer than 50 patients, those that recruited non-consecutively (focused only on patients who underwent surgery without follow-up, or only on patients managed conservatively), and those that provided poor follow-up information for more than 30% of patients (supplementary table 1, appendix 3).6 Poor follow-up information was defined as the absence of a study outcome (spontaneous resolution or histology based on surgery at any point during follow-up) and last follow-up visit less than ten months after inclusion.

Prediction models

We evaluated several ultrasound based prediction models: RMI, LR2, simple rules, SRRisk, ADNEX without CA125, and ADNEX with CA125 (table 1). Model predictions are based on information obtained at the inclusion scan and so are blinded to the outcome. RMI does not give an estimated risk, but a non-negative integer (0 or higher), with higher scores suggesting a higher likelihood of malignancy. LR2 and SRRisk calculate the risk that the tumour is malignant. ADNEX calculates the probability of five outcome categories: benign, borderline, stage I primary invasive ovarian malignancy, stage II-IV primary invasive ovarian malignancy, and metastasis in the adnexa from another primary tumour (eg, breast cancer). For ADNEX, one minus the probability of a benign tumour equals the estimated risk of malignancy. The simple rules classify tumours as benign, inconclusive, or malignant based on the presence of five typical ultrasound features of benign tumours and five typical ultrasound features of malignant tumours. The prediction is inconclusive when none of the 10 features is present, or when a mixture of benign and malignant features is present. Here, we add inconclusive tumours to those predicted to be malignant, resulting in a binary classifier. Appendix 4 gives details of predictors and model formulas.

Table 1

Summary of diagnostic prediction models for ovarian malignancy

View this table:

Outcomes

The reference standard describes the nature of the adnexal mass. The primary outcome was classification of tumours as benign or malignant. This classification was based on histology when patients had surgery or subjective assessment at inclusion and during follow-up until 12 (±2) months when surgery was not performed. We considered the outcome as uncertain when not enough information was available to make a reasonable classification of the mass as benign or malignant at inclusion. Table 2 shows detailed classification criteria. Pathologists were blinded to ultrasound predictor variables and model predictions, but might have received information on the subjective assessment by the ultrasound examiner when clinically relevant. Borderline tumours were classified as malignant.

Table 2

Definition of tumour outcome based on histology or clinical information

View this table:

For a full evaluation of ADNEX, we used a multinomial reference standard describing the adnexal mass at inclusion as benign, borderline malignant, stage I primary ovarian malignancy, stage II-IV primary ovarian malignancy, or secondary metastatic malignancy (secondary outcome).

Statistical analysis

We followed a prespecified statistical analysis plan for this study. Appendix 5 presents the sample size determination for the IOTA5 study. We had missing values for CA125 and some outcomes were labelled uncertain and therefore missing. We used multiple imputation to address these missing values (appendix 5). The imputations were based on variables used as predictors in the models, and variables that are associated with CA125 or with outcome, or with their missingness. Our primary analysis included patients after multiple imputation of missing values.

We evaluated discrimination between benign and malignant tumours with the area under the receiver operating characteristic curve (AUC) for the risk prediction models and RMI. To account for variability in performance between centres (heterogeneity), we used meta-analysis of centre specific AUCs to obtain the overall AUC for each model. Heterogeneity was quantified using 95% prediction intervals, which indicate which AUC values can be expected when evaluating the model in a new centre.25 We used the DeLong method to calculate 95% confidence intervals for the difference in AUC between two models.26 For ADNEX, we calculated the AUC for each pair of tumour types.27

We calculated sensitivity and specificity for prespecified thresholds for RMI, LR2, SRRisk, and ADNEX. For any threshold, patients with a result at or above the threshold were classified at high risk of malignancy. We compared RMI with other models by calculating sensitivity when fixing specificity at 90%, and specificity when fixing sensitivity at 90%. Overall sensitivity and specificity values were obtained by using a meta-analysis of centre specific results.

We assessed calibration of LR2, SRRisk, and ADNEX by calculating calibration intercept and slope, and used these values to generate centre specific and overall calibration curves. The calibration intercept assesses whether risks are generally overestimated (intercept <0) or underestimated (intercept >0). The calibration slope assesses whether risks are too extreme (slope <1) or too moderate (slope >1).28 When too extreme, low estimated risks are underestimated and high risks are overestimated. When too moderate, low risks are overestimated and high risks are underestimated. For RMI, we performed an analogous analysis to estimate the prevalence of malignancy conditional on the RMI value, and constructed centre specific and overall curves. For ADNEX, we assessed calibration for all five predicted outcomes.29

We assessed clinical utility by using decision curve analysis for risk thresholds between 5% and 50% to decide which patients should be referred to specialised oncological care. We report overall decision curves based on a meta-analysis of centre specific curves.30

We obtained overall AUCs and calibration curves for several prespecified subgroups: actual management (surgery within 120 days without any follow-up scan v at least one follow-up scan), management suggested by ultrasound examiner (surgery v conservative management with follow-up visits), menopausal status, and type of centre. Overall AUCs and calibration curves were computed for several prespecified sensitivity analyses: an analysis that excludes masses with uncertain outcome (U1-U4 in table 2); an analysis in which the definition of an uncertain outcome is expanded to include groups B2 and M2-M3 in table 2 (all groups in which subjective assessment of ultrasound images was used to classify outcomes as benign or malignant); and an analysis from all 36 centres of patients who underwent surgery within 120 days without any follow-up scan (not restricted to centres with high quality follow-up data). Appendix 5 presents details of the statistical analysis. The analysis was performed by using R version 3.5.1.

Patient and public involvement

Patients were not involved in the study design, definition of outcome measures, recruiting plans of the study, or interpretation of study results. We discussed the study with KanActief, a cancer rehabilitation patient group at the University Hospitals Leuven (https://www.uzleuven.be/nl/kanactief).

Results

In total, 98 ultrasound examiners at 36 centres recruited 8519 patients into the interim dataset of IOTA5 (supplementary table 1). After we applied the exclusion criteria (appendix 3) and data cleaning, our primary analysis consisted of 4905 patients recruited by 58 ultrasound examiners at 17 centres (fig 1, table 3).

Fig 1
Fig 1

Study flowchart. Criteria for excluding centres were fewer than 50 patients recruited, non-consecutive recruitment, or insufficient quality of follow-up data (appendix 3). Eleven of 20 oncology centres and 8 of 16 non-oncology centres were excluded. Supplementary table 1 gives details of excluded centres. IOTA5=International Ovarian Tumour Analysis phase 5 study

Table 3 Overview of 17 centres included in primary analysis. Data are number or number (row percentage)

View this table:

The median age of the 4905 patients was 48 years (interquartile range 36-62, range 18-98), and 2151 patients (44%) were postmenopausal (table 4). Information on CA125 was missing in 2620 of the 4905 (53%) patients: 835 of 2579 (32%) missing values when surgery was suggested and 1785 of 2326 (77%) missing values when conservative management was suggested. The outcome was benign for 3441 (70%) patients, malignant for 978 (20%), and uncertain for 486 (10%) patients. The tumours in the current cohort manifested more benign ultrasound features compared with the development datasets of the different models, which were limited to patients who underwent surgery (supplementary table 2).

Table 4

Descriptive statistics for patients in primary analysis (n=4905)

View this table:

The overall AUC was highest for ADNEX with CA125 (0.94, 95% confidence interval 0.92 to 0.96), ADNEX without CA125 (0.94, 0.91 to 0.95) and SRRisk (0.94, 0.91 to 0.95), and lowest for RMI (0.89, 0.85 to 0.92; fig 2). Differences in AUC between centres (heterogeneity) were largest for RMI, with a 95% prediction interval from 0.74 to 0.96 (supplementary figs 1-5). Supplementary table 3 provides 95% confidence intervals for the difference in AUC between models. At a risk threshold of 10%, ADNEX with CA125 had an overall sensitivity of 91% (95% confidence interval 85% to 95%) and specificity of 85% (81% to 89%). At a threshold of 200, RMI had an overall sensitivity of 60% (54% to 67%) and specificity of 95% (93% to 97%; supplementary tables 4-5). When overall specificity was fixed at 90%, SRRisk had the highest sensitivity (89%) and RMI the lowest (70%), while the sensitivity for ADNEX with CA125 was 87% (table 5). When overall sensitivity was fixed at 90%, ADNEX with CA125 had the highest specificity (87%) and RMI the lowest (69%; table 5). The simple rules model had an overall sensitivity of 90% (86% to 94%) and a specificity of 87% (83% to 91%).

Fig 2
Fig 2

Summary forest plot with overall area under the receiver operating characteristic curve (AUC) for each model. ADNEX=assessment of different neoplasias in the adnexa; LR2=logistic regression model 2; PI=prediction interval; RMI=risk of malignancy index; SRRisk=simple rules risk model

Table 5

Sensitivity (at 90% specificity) and specificity (at 90% sensitivity) for all prediction models

View this table:

ADNEX with CA125 had an overall calibration intercept of 0.19 (95% confidence interval −0.01 to 0.40) and a slope of 1.11 (0.98 to 1.25). Risk estimates were slightly underestimated (fig 3). Calibration of SRRisk was marginally better and that of LR2 was poorer. We observed heterogeneity between centres for calibration for all models, with least heterogeneity for ADNEX with CA125 (supplementary figs 6-10). The overall calibration curve for RMI (supplementary fig 11) indicated that the commonly used threshold of 200 corresponded to a risk of malignancy of 45-50% on average. Supplementary figs 12-16 present histograms of the predictions of RMI and the risk prediction models.

Fig 3
Fig 3

Summary figure with overall calibration curves for risk prediction models. ADNEX=assessment of different neoplasias in the adnexa; intercept=calibration intercept; LR2=logistic regression model 2; RMI=risk of malignancy index; slope=calibration slope; SRRisk=simple rules risk model

SRRisk and ADNEX with CA125 had the best overall utility to select patients for referral to a gynaecological oncology centre (fig 4). RMI at a threshold of 200 had the lowest clinical utility of all the models.

Fig 4
Fig 4

Overall decision curves for risk prediction models and RMI. Higher net benefit implies higher clinical utility (the higher the curve, the better the clinical utility at the chosen risk threshold).1830 ADNEX=assessment of different neoplasias in the adnexa; LR2=logistic regression model 2; RMI=risk of malignancy index; SRRisk=simple rules risk model

For the ADNEX model with CA125, distinguishing between borderline and stage I primary ovarian malignancy (AUC 0.77), between stage I primary ovarian malignancy and secondary metastatic cancer (AUC 0.75), and between stage II-IV primary ovarian malignancy and secondary metastatic cancer (AUC 0.78) was the most difficult (supplementary table 6). AUCs ranged from 0.90 to 0.98 when distinguishing between benign tumours and malignant subtypes. ADNEX without CA125 mainly affected discrimination between stage II-IV and stage I primary ovarian malignancy (AUC was 0.81 when CA125 was included v 0.72 when CA125 was not included), and between stage II-IV primary ovarian malignancy and secondary metastatic malignancy (AUC 0.78 v 0.66). Calibration of the estimated risks for the five tumour subtypes was good for ADNEX irrespective of whether or not CA125 was included as a predictor (supplementary figs 17-18).

In every subgroup, RMI had the lowest overall AUC and ADNEX with CA125 the highest overall AUC (fig 5, supplementary table 7). Among the 1958 patients with at least one follow-up scan, the overall AUC was 0.76 for RMI and ranged from 0.87 to 0.89 for the IOTA models. Because of the low malignancy rate (2%) in this subgroup, the confidence interval around the AUC was wide. Calibration analysis indicated that the risk of malignancy was overestimated in this subgroup (supplementary figs 19-34). The results obtained in the sensitivity analyses were similar to those in the primary analysis (supplementary figs 35-43).

Fig 5
Fig 5

Summary forest plots of overall area under the receiver operating characteristic curve (AUC) for prespecified subgroups. Prediction intervals could not be calculated for two subgroups because the number of malignant outcomes for each centre was too small for meta-analysis to be possible. ADNEX=assessment of different neoplasias in the adnexa; LR2=logistic regression model 2; PI=prediction interval; RMI=risk of malignancy index; SRRisk=simple rules risk model

Discussion

Principal findings

This study is a comprehensive evaluation of RMI and IOTA models when applied to all patients presenting with an adnexal mass, irrespective of whether they received surgical or conservative management. ADNEX with CA125, ADNEX without CA125, and SRRisk were the best models to distinguish between benign and malignant adnexal masses. These models were reasonably well calibrated overall, and had the highest clinical utility. RMI had the lowest AUC and clinical utility. Performance varied between centres for all models, but it varied most for RMI. In every prespecified subgroup, ADNEX with CA125 had the highest AUC and RMI the lowest AUC.

All models, in particular RMI, had poorer discriminative ability in patients who were managed conservatively than in those who underwent surgery. This difference could be because masses managed conservatively are more homogenous than those removed surgically. Most masses managed conservatively probably manifested clearly benign ultrasound signs, few were malignant, and most malignancies managed conservatively were borderline tumours, with ultrasound features that could be confused with those of benign tumours.3132 All models overestimated the risk of malignancy in patients who were managed conservatively, which was expected because none of the models was developed for this population.

Strengths and limitations of the study

Our study has several strengths. We included patients irrespective of whether they were managed surgically or conservatively, and the large sample size allowed us to evaluate models in clinically relevant subgroups. Additionally we recruited from many centres in different countries, used a large number of ultrasound examiners, and implemented a rigorous prospective protocol with agreed ultrasound terms, definitions, and measurement technique.22 Finally, we evaluated the calibration and clinical utility of all the models.

The first limitation is that several centres had to be excluded because of non-consecutive recruitment or insufficient quality of follow-up data. However, a similar proportion of oncology centres (11/20) and non-oncology centres (8/16) were excluded. The second, inevitable limitation is that our reference standard is based on two different methods: histology or results of clinical and ultrasound follow-up.33 Follow-up information is probably less accurate than histology when assigning the outcome; for some patients the outcome was partly based on subjective assessment at inclusion. We limited the risk of bias in our primary analysis by using multiple imputation to assign an outcome when clinical and ultrasound information was insufficient or inconsistent (n=486, 10% of all cases). Excluding uncertain outcomes might have induced bias because the assumption could be made that patients who do not undergo surgery without delay are more likely to have a benign tumour. Our two sensitivity analyses, one excluding all uncertain cases (U1-U4 in table 2) and one using imputation to assign an outcome when a broader definition of uncertain outcome was applied (B2, M2-M3, and U1-U4 in table 2; n=1419, 31% of all cases), showed similar results to those in our primary analysis. The third limitation is that CA125 values were missing in a substantial number of patients. We addressed the missing values using multiple imputation.34 Imputing missing values multiple times acknowledges that we are uncertain about the true value. The observation that the ranking of models in terms of AUC was the same in women who underwent surgery (low proportion of missing CA125 values) and in women who received conservative management (high proportion of missing CA125 values) suggests that our results are robust.

Comparison with other studies

Two small single centre studies have evaluated IOTA models on all patients irrespective of management. Nunes and colleagues (n=489) evaluated sensitivity and specificity of LR2 using a one risk cut-off point.19 Pereira and colleagues (n=170) evaluated the clinical utility of simple rules and SRRisk.20 Other published validation studies were limited to patients who underwent surgery. These studies showed that IOTA models distinguished better between benign and malignant adnexal masses than RMI, and that ADNEX might be the best performing model.1213141635363738 The studies also showed that SRRisk and ADNEX had good overall calibration (the authors did not report centre specific results) and better clinical utility than other models, including RMI, to refer patients to an oncology centre.10111718 To distinguish between benign and malignant masses in patients who underwent surgery, the following AUCs have been reported for ADNEX with CA125: 0.94 in the original study, between 0.91 and 0.97 in external validation studies, and 0.92 in the subgroup that underwent surgery in the current study.1017373839404142

Implications for practice

In our study, the ADNEX models and SRRisk were the best performing models and were similar in terms of discrimination, calibration, and clinical utility. However, we would recommend ADNEX rather than SRRisk for several reasons. Firstly, ADNEX uses only simple and robust ultrasound variables, and so less experience should be needed for correct use of this model compared with SRRisk. Secondly, ADNEX estimates the likelihood of five different tumour types. This information could help when deciding which investigations to perform (investigations would differ if a metastasis is suspected rather than primary ovarian cancer), which surgical strategy to choose (fertility sparing surgery could be considered if a borderline tumour is suspected), and how long the waiting time should be for the operation. The likelihood of different tumour types can also help when deciding on the appropriate skills of the surgeon and estimating the duration of surgery. Other models do not provide this information.

Because CA125 results are often not available when the patient is examined with ultrasound, ADNEX without CA125 can be used as a first step to distinguish between benign and malignant tumours during scanning. If ADNEX without CA125 yields a high risk of malignancy, blood sampling with analysis of CA125 can be arranged so that the most likely type of malignancy can be estimated by using ADNEX with CA125. Including CA125 in ADNEX improves discrimination between the malignant tumour types, but has little effect on the discrimination between benign and malignant masses. For application in clinical practice, ADNEX is available as an app for iOS or Android and as a web calculator (https://iotagroup.org/iota-models-software/adnex-risk-model). Some ultrasound machines have built-in functionalities that allow ADNEX to be used while the patient is being scanned.

A prerequisite for safe use of the IOTA models is basic ultrasound skills. Additionally, ultrasound examiners must have obtained the IOTA certificate (https://www.iotagroup.org/certified-members) to show that they are able to correctly use the IOTA terminology and measurement technique (http://www.iota.education).2243 For safe use of ADNEX, examiners need to use the IOTA definitions for solid component, papillary projection, acoustic shadowing, and ascites.

Because of the consecutive recruitment of patients into our study, the substantial sample size, and the large number of ultrasound examiners and participating centres in different countries, our results could be generalisable to other patient populations. However, for all models heterogeneity existed between centres in terms of performance, in particular calibration. The heterogeneity can probably be explained by differences in tumour characteristics that are not captured by the predictors, and in tumour mix (that is, the relative proportion of benign, borderline, primary invasive malignancies, and metastases). Further research should focus on explaining and reducing this heterogeneity. The performance of the ADNEX model could be improved by updating the model by using data from patients who are conservatively managed and those who have had surgery.

Conclusions and policy implications

Our study has shown that ADNEX with or without CA125 and SRRisk are the best models for distinguishing between benign and malignant tumours in patients presenting with an adnexal mass. Because ADNEX is preferable to SRRisk for practical reasons, the model should be recommended for characterising ovarian tumours. The next step is to achieve consensus on the risk thresholds to be used when deciding whether patients with an adnexal tumour should receive conservative management, surgery in a local centre, or be referred to a gynaecological oncology centre for further evaluation.

What is already known on this topic

  • Methods are needed to predict malignancy in ovarian tumours so that optimal management can be selected, such as watch and wait, surgery in a local hospital, or treatment in a gynaecological oncology centre

  • Existing prediction models are based only on data from patients who have undergone surgery

  • Model validation has mostly used data from patients who have had surgery and has rarely included assessment of calibration or clinical utility

What this study adds

  • Assessment of different neoplasias in the adnexa (ADNEX) with or without CA125 and simple rules risk model (SRRisk) are the best models for distinguishing between benign and malignant adnexal tumours

  • ADNEX with CA125 performed best when distinguishing between benign and malignant tumours in all subgroups (including patients managed conservatively)

  • ADNEX has practical advantages over SRRisk, including the ability to estimate risk of malignant subtypes

Acknowledgments

We thank everyone who provided assistance for monitoring the study and completing the database. In particular, we thank Bavo De Cock, Willem Mestdagh, Mona Aboulghar, Ulrike Metzger, Anna Knafel, Chiara Lanzani, Fatima Alves, Thierry Van den Bosch, Samir Helmy, Alberto Rossi, and Maria Angela Pascual. We also thank librarian Maria Björklund, Lund University, for helping with literature search.

Footnotes

  • Contributors: BVC, LV, ACT, TB, and DT conceived and designed the study. LV, WF, CL, ACT, PS, CVH, ED, RF, EE, DF, MJK, VC, JLA, FPGL, FB, MEC, SG, ND, LJ, LS, DF, ACz, JK, ACo, GS, IV, TB, and DT enrolled patients and acquired data. BVC, WF, CL, JC, and DT were involved in data cleaning. BVC, LV, WF, CL, JC, LW, and DT wrote the Statistical Analysis Plan. BVC and JC analysed the data, with support from LW. BVC, LV, WF, CL, JC, LW, TB, and DT were involved in data interpretation. BVC, LV, WF, CL, JC, and DT wrote the first draft of the manuscript. All authors critically reviewed and revised the draft. All authors approved the final version of the manuscript for submission. BVC and LV contributed equally. BVC, LV, and DT are the guarantors. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding: The IOTA5 study is supported by the Research Foundation – Flanders (FWO) projects G049312N/G0B4716N/12F3114N, and Internal Funds KU Leuven (project C24/15/037). DT is a senior clinical investigator of FWO, LW is a postdoctoral fellow of FWO, and WF was a clinical fellow of FWO. CL was supported by Linbury Trust Grant LIN 260. TB is supported by the National Institute for Health Research (NIHR) Biomedical Research Centre based at Imperial College Healthcare National Health Service (NHS) Trust and Imperial College London. The views expressed are those of the authors and not necessarily those of the NHS, NIHR, or Department of Health. LV is supported by the Swedish Research Council (grant K2014-99X-22475-01-3, Dnr 2013-02282), funds administered by Malmö University Hospital and Skåne University Hospital, Allmänna Sjukhusets i Malmö Stiftelse för bekämpande av cancer (the Malmö General Hospital Foundation for fighting against cancer), and two Swedish governmental grants (Avtal om läkarutbildning och forskning (ALF)-medel and Landstingsfinansierad Regional Forskning). The funders played no role in study design, data collection, data analysis, data interpretation, or reporting. The guarantors had full access to all the data in the study, take responsibility for the integrity of the data and the accuracy of the data analysis, and had final responsibility for the decision to submit for publication.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: grants from Research Foundation – Flanders (FWO), Internal Funds KU Leuven, Linbury Trust, NIHR Biomedical Research Centre, and Swedish Research Council for the submitted work; TB reports grants, personal fees, and travel support from Samsung Medison, travel support from Roche Diagnostics, and personal fees from GE Healthcare, all outside the submitted work; IV reports grants, personal fees and non-financial support from Roche NV, outside the submitted work; BVC and DT report consultancy work done by KU Leuven to help implementing and testing the ADNEX model in ultrasound machines by Samsung Medison and GE Healthcare, outside the submitted work; no other relationships or activities that could appear to have influenced the submitted work; no royalties or patents related to any of these models (neither for the authors nor for their institutions).

  • Ethical approval: Approval was obtained from the ethics committee of the University Hospitals Leuven (Ethische Commissie Onderzoek UZ/KU Leuven, https://www.uzleuven.be/nl/ethischecommissie/onderzoek) as the coordinating centre (B32220095331/S51375) and the local ethics committee of each contributing centre.

  • Data sharing: Follow-up data collection for the IOTA5 study is still ongoing. When the study has closed, and results have been published, the corresponding author may share data upon reasonable request (dirk.timmerman@uzleuven.be). R code used for the statistical analysis is available on GitHub (https://github.com/benvancalster/IOTA5modelvalidation2020).

  • The lead authors affirm that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

  • Dissemination to participants and related patient and public communities: We plan to disseminate and discuss the results with patient groups and societies for gynaecology and gynaecological oncology, and through both media organisations and social media. This will be highly valuable when it comes to implementation of models into clinical practice.

http://creativecommons.org/licenses/by/4.0/

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/.

References

View Abstract