Development and validation of a decision support tool for the diagnosis of acute heart failure: systematic review, meta-analysis, and modelling study

Abstract Objectives To evaluate the diagnostic performance of N-terminal pro-B-type natriuretic peptide (NT-proBNP) thresholds for acute heart failure and to develop and validate a decision support tool that combines NT-proBNP concentrations with clinical characteristics. Design Individual patient level data meta-analysis and modelling study. Setting Fourteen studies from 13 countries, including randomised controlled trials and prospective observational studies. Participants Individual patient level data for 10 369 patients with suspected acute heart failure were pooled for the meta-analysis to evaluate NT-proBNP thresholds. A decision support tool (Collaboration for the Diagnosis and Evaluation of Heart Failure (CoDE-HF)) that combines NT-proBNP with clinical variables to report the probability of acute heart failure for an individual patient was developed and validated. Main outcome measure Adjudicated diagnosis of acute heart failure. Results Overall, 43.9% (4549/10 369) of patients had an adjudicated diagnosis of acute heart failure (73.3% (2286/3119) and 29.0% (1802/6208) in those with and without previous heart failure, respectively). The negative predictive value of the guideline recommended rule-out threshold of 300 pg/mL was 94.6% (95% confidence interval 91.9% to 96.4%); despite use of age specific rule-in thresholds, the positive predictive value varied at 61.0% (55.3% to 66.4%), 73.5% (62.3% to 82.3%), and 80.2% (70.9% to 87.1%), in patients aged <50 years, 50-75 years, and >75 years, respectively. Performance varied in most subgroups, particularly patients with obesity, renal impairment, or previous heart failure. CoDE-HF was well calibrated, with excellent discrimination in patients with and without previous heart failure (area under the receiver operator curve 0.846 (0.830 to 0.862) and 0.925 (0.919 to 0.932) and Brier scores of 0.130 and 0.099, respectively). In patients without previous heart failure, the diagnostic performance was consistent across all subgroups, with 40.3% (2502/6208) identified at low probability (negative predictive value of 98.6%, 97.8% to 99.1%) and 28.0% (1737/6208) at high probability (positive predictive value of 75.0%, 65.7% to 82.5%) of having acute heart failure. Conclusions In an international, collaborative evaluation of the diagnostic performance of NT-proBNP, guideline recommended thresholds to diagnose acute heart failure varied substantially in important patient subgroups. The CoDE-HF decision support tool incorporating NT-proBNP as a continuous measure and other clinical variables provides a more consistent, accurate, and individualised approach. Study registration PROSPERO CRD42019159407.


S2.1 Extreme gradient boosting model
XGBoost is a supervised machine learning technique initially proposed by Chen and Guestrin. 1 In brief, gradient boosting employs an ensemble technique to iteratively improve model accuracy for regression and classification problems. This ensemble-based algorithm is achieved by creating sequential models, using decision trees as learners where subsequent models attempt to correct errors of the preceding models. 2 3 In the boosting method, individuals that were misclassified by the previous model are assigned a higher weight to increase their chance of being selected in subsequent models. Each model is subsequently fitted in a step-wise fashion to minimise loss function such as absolute error or squared error (the amount predicted values differ from the true values). XGBoost refers to the reengineering of gradient boosting to significantly improve the speed of the algorithm by pushing the limits of computational resources. The output of the XGBoost model is a probability that is computed by performing an inverse-logit transformation of the sum of the weights of the terminal nodes of the trained model.
The mathematical formula for the gradient boosting model can be described as: (1) where f is an function that map each variable vector x i (x i = { x i , x 2 , …, x n }, i = 1, 2, N) to the outcome y i , K is the number of Classification and Regression Trees (CART) and F is the space of function containing all CART. 5 XGBoost optimises an objective function of the form: (2) Where the first term is a loss function, l, which evaluates how well the model fits the data by measuring the difference between the prediction ŷ i and the outcome y i . The second term, the regularization term, is used by XGBoost to avoid overfitting by penalizing the complexity of the model. Furthermore, to improve and fully leverage the advantages of XGBoost we tuned the hyper-parameters of the algorithm defined below through a grid search strategy using 10fold cross-validation.
The hyper-parameter values for the model in patients without prior heart failure were: the number of iterations (trees) was set to 81, the learning rate (shrinkage parameter applied to each tree in the expansion) was set to 0.1, the interaction depth (maximum depth of each tree, expresses the highest level of variable interactions allowed) was set to 8, the minimum number of observations in the terminal nodes was set to 5.7, the fraction of the training set observations randomly selected for each subsequent tree was set to 0.55 and the fraction of variables randomly sampled for each tree was set to 0.84.
The hyper-parameter values for the model in patients with prior heart failure were: the number of iterations (trees) was set to 69, the learning rate (shrinkage parameter applied to each tree in the expansion) was set to 0.06, the interaction depth (maximum depth of each tree, expresses the highest level of variable interactions allowed) was set to 5, the minimum number of observations in the terminal nodes was set to 6.7, the fraction of the training set observations randomly selected for each subsequent tree was set to 0.73 and the fraction of variables randomly sampled for each tree was set to 0.62.

S2.1.1 Relative feature importance plots
a) Relative feature importance plot for the model developed for patients without prior heart failure. b) Relative feature importance plot for the model developed for patients with prior heart failure

S2.2 Generalised linear mixed model
A study identifier was included as a random effects variable whilst all other variables (NT-proBNP, age, estimated glomerular filtration rate, hemoglobin, body mass index, heart rate, blood pressure, peripheral edema, prior history of heart failure, chronic obstructive pulmonary disease and ischemic heart disease) were fitted as fixed effects variables. Due to the positive-skew in NT-proBNP concentrations, we used a logarithmic transformation in the model. We further evaluated non-linear relationships between continuous variables and the diagnosis using multivariable fractional polynomial methods. 4 The model was developed using the R package 'lme4'(https://cran.r-project.org/web/packages/lme4). .

S2.3 Naïve Bayes
Naive Bayes (NB) 5 , is a supervised machine learning algorithm based on Bayes' Theorem with an "naive" assumption of independence among features. In brief, a NB algorithm assumes no relationship between features. We used a kernel density estimation function to achieve higher accuracy levels. The algorithm was developed using the R package 'naivebayes'(https://cran.r-project.org/web/packages/naivebayes).

S2.4 Random forest
Random Forest (RF) 6 , is a supervised machine learning algorithm. It is an ensemble technique that combines a large number of decision trees using a bagging approach to improve the overall performance. In brief, the bagging approach grows multiple classification trees in parallel where each tree gives a classification which are called votes. These votes are then aggregated to provide a more accurate and stable prediction.
We tuned the RF hyper-parameters during the development of this model through a grid search strategy using 10-fold cross-validation. The hyper-parameters tuned were the number of trees in the forest, the number of variables randomly sampled as candidates at each split, the maximum depth of the tree and the minimum number of samples required to split an internal node.

A) Receiver operator curve for all statistical models in the patients without prior heart failure B) Receiver operator curve for all statistical models in the patients without prior heart failure
Whilst the performance of XGBoost was similar to the generalised linear mixed-model, a key advantage of XGBoost is its ability to compute a score despite missing values. This is a critical functionality for the application of the CoDE-HF decision-support tool in clinical practice because clinicians may not always have all information available to them during the initial clinical encounter in the Emergency Department.

III. Supplementary Tables
Supplementary Table A All consecutive non-trauma patients aged ≥70 years who were admitted to the Emergency Department.
Patients with acute ST-elevation myocardial infarction, planned elective coronary revascularisation, hospitalization for unstable angina within the preceding 2 months, coronary-artery bypass grafting or percutaneous transluminal angioplasty within the preceding 3 months. Patients were also excluded if they had renal failure requiring dialysis, trauma with suspected myocardial contusion, life expectancy <6 months, or if they did not consent to providing a blood sample for use by the research team. Behnes et al, 2009 8 Consecutive patients presenting with symptoms of acute dyspnoea and/ or peripheral oedema in the Emergency Department.
Patients suffering from severe renal disease (defined as serum creatinine level greater than 2.8 mg/dl) or anemia (hemoglobin concentrations below 8.0 g/dl) were excluded. Further exclusion criteria were obvious traumatic causes of dyspnea, pregnancy, a status after immediate cardiopulmonary resuscitation, participation in another clinical trial and age under 18 years. Bombelli et al, 2015 9 Consecutive patients aged 80 years or more evaluated in the Emergency Department in whom NT-proBNP was measured.

Chenevier-Gobeaux et al, 2005 10
Consecutive patients presenting to the Emergency Department with dyspnoea. deFilippi et al, 2007 11 Consecutive patients presenting to the Emergency Department with dyspnoea and who underwent measurement of a natriuretic peptide at presentation.
Patients younger than 18 years or in whom there was inadequate clinical information recorded to assess the aetiology of dyspnoea were excluded. Gargani et al, 2008 12 Patients with dyspnoea at admission as reported on the case history, had a venous blood sample taken for NT-proBNP analysis on the day of admission, underwent assessment for ultrasound lung comets performed within 4 h of the NT-proBNP measurement and did not receive diuretic therapy between the two measurements. Ibrahim et al, 2017 13 Shortness of breath as the primary complaint triggering presentation to the Emergency Department.
Age under 21 years, shortness of breath related to trauma, and current renal haemodialysis were exclusion criteria. Januzzi et al, 2006 14 Dyspnoeic Emergency Department patients. Maisel et al, 2010 15 Patients reporting shortness of breath as primary Patients <18 years of age, unable to provide consent, had an acute 20 complaint in the Emergency Department. ST-segment elevation myocardial infarction, were receiving haemodialysis, or had renal failure. Moe et al, 2007 16 Consecutive patients >18 years of age presenting to the Emergency Department with dyspnoea of suspected cardiac origin.
Patients with advanced renal failure (serum creatinine >250 micromol/L), acute myocardial infarction, malignant disorders, and dyspnoea from clinically overt origins, including pneumothorax and chest wall trauma. Mueller et al, 2005 17 Consecutive patients presenting with dyspnoea as chief complaint to the Emergency Department.
Patients with ST elevation myocardial infarction, non-ST elevation myocardial infarction, or acute coronary syndrome troponin positive and trauma patients. Nazerian et al, 2010 18 Convenience sample of patients presenting to the Emergency Department with acute dyspnoea as the main symptom.
Patients with trauma, ST-elevation myocardial infarctions, or dyspnoea clearly caused by something other than heart failure, such as pneumothorax, were excluded. Patients were also excluded if they had received intravenous therapy in the Emergency Department before echocardiogram and NT-proBNP were performed. Patients who met the inclusion criteria were invited to participate in the study. Echocardiogram was performed in all patients who met the inclusion criteria. However, if the investigator judged that both left ventricular (LV) ejection fraction and pulsed Doppler analysis of mitral inflow were not interpretable due to a very poor acoustic window, the patient was excluded from the study because echocardiogram was not feasible. Rutten et al, 2008 19 Patients were eligible if they presented with acute dyspnoea as their most prominent complaint.
Patients with acute dyspnoea due to trauma or cardiogenic shock and patients with renal failure requiring haemodialysis or peritoneal dialysis were excluded. Wussler et al, 2019 20 Adult patients presenting with acute dyspnoea to the Emergency Department.
Patients with terminal kidney failure requiring haemodialysis were excluded.

21
Author, year Diagnostic adjudication for acute heart failure Risk of bias (QUADAS-2)

Bahrmann et al, 2015 7
Independent adjudication by two cardiologists based on the definition of the ESC guideline. They reviewed all available medical records of the index hospital stay, including the clinical history findings from the physical examination, results of laboratory tests (excluding NT-proBNP), radiographic studies, ECG, and echocardiography. Independent adjudication by an ED specialist and a cardiologist. They were blinded to NT-proBNP measurements but could access medical records, case report forms, and other test results including cardiac imaging as available.  16 Independent adjudication by two cardiologists. They were provided with hospital records, including the discharge summary, results of laboratory and radiographic testing, echocardiograms if performed, clinical notes from the time of ED presentation to the 60-day follow-up, and outcome of a telephone interview. Using all available data, the cardiologists assigned a diagnosis without knowledge of the NT-proBNP results. Independent adjudication by two cardiologists and one respiratory physician, blinded to echocardiogram and NTproBNP results. The reviewers had access to ED records, clinical notes, components and summary of the Framingham Heart Study Criteria and any additional information that became available during hospital stay. Patient selection: high; index test: low; reference standard: low; flow and timing: low; Overall: high. 19 Consensus between two clinicians in internal medicine, pulmonology or cardiology. Adjudicated by 2 independent cardiologist-internists who had access to all patients' medical records, including clinical history, physical examination, 12-lead electrocardiograms, laboratory findings, chest radiographs, echocardiograms, lung function test results, computed tomography scans, and response to therapy, as well as autopsy data for patients who died in the hospital.

Supplementary Figure M. Decision curve analysis for CoDE-HF versus NT-proBNP alone
The decision curve analysis presents the net benefit of the CoDE-HF score and NT-proBNP alone in comparison to hypothetical default approaches to diagnose all patients or no patients with acute heart failure. Net benefit for each approach is calculated across a range of possible threshold probabilities. 21 22 Threshold probability is defined in this context as the minimum probability at which a diagnosis and treatment for acute heart failure is likely to be beneficial for patients. Net benefit is calculated using the following formula: w is the odds at the threshold probability.