External validation and comparison of three prediction tools for risk of osteoporotic fractures using data from population based electronic health records: retrospective cohort studyBMJ 2017; 356 doi: https://doi.org/10.1136/bmj.i6755 (Published 19 January 2017) Cite this as: BMJ 2017;356:i6755
- Noa Dagan, chief data officer, and PhD student1 2,
- Chandra Cohen-Stavi, chief scientific writing officer1,
- Maya Leventer-Roberts, deputy director, and adjunct assistant professor1 3,
- Ran D Balicer, director, and associate professor1 4
- 1Clalit Research Institute, Chief Physician’s Office, Clalit Health Services, Tel Aviv, Israel
- 2Computer Science Department, Ben Gurion University of the Negev, Be’er Sheba, Israel
- 3Department of Preventive Medicine and Department of Pediatrics, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- 4Epidemiology Department, Ben Gurion University of the Negev, Be’er Sheba, Israel
- Correspondence to: N Dagan
- Accepted 8 December 2016
Objective To directly compare the performance and externally validate the three most studied prediction tools for osteoporotic fractures—QFracture, FRAX, and Garvan—using data from electronic health records.
Design Retrospective cohort study.
Setting Payer provider healthcare organisation in Israel.
Participants 1 054 815 members aged 50 to 90 years for comparison between tools and cohorts of different age ranges, corresponding to those in each tools’ development study, for tool specific external validation.
Main outcome measure First diagnosis of a major osteoporotic fracture (for QFracture and FRAX tools) and hip fractures (for all three tools) recorded in electronic health records from 2010 to 2014. Observed fracture rates were compared to probabilities predicted retrospectively as of 2010.
Results The observed five year hip fracture rate was 2.7% and the rate for major osteoporotic fractures was 7.7%. The areas under the receiver operating curve (AUC) for hip fracture prediction were 82.7% for QFracture, 81.5% for FRAX, and 77.8% for Garvan. For major osteoporotic fractures, AUCs were 71.2% for QFracture and 71.4% for FRAX. All the tools underestimated the fracture risk, but the average observed to predicted ratios and the calibration slopes of FRAX were closest to 1. Tool specific validation analyses yielded hip fracture prediction AUCs of 88.0% for QFracture (among those aged 30-100 years), 81.5% for FRAX (50-90 years), and 71.2% for Garvan (60-95 years).
Conclusions Both QFracture and FRAX had high discriminatory power for hip fracture prediction, with QFracture performing slightly better. This performance gap was more pronounced in previous studies, likely because of broader age inclusion criteria for QFracture validations. The simpler FRAX performed almost as well as QFracture for hip fracture prediction, and may have advantages if some of the input data required for QFracture are not available. However, both tools require calibration before implementation.
Osteoporotic fractures cause major morbidity and mortality, with many people who experience such fractures rapidly deteriorating in health status and experiencing a lower quality of life.12 This poses a substantial economic burden to health systems, patients, and their families.3 The burden of osteoporotic fractures is expected to increase as populations age, with the incidence of hip fractures reported to increase 30-fold between the ages of 50 and 90 years.4 Osteoporotic fractures and re-fractures can be prevented and better managed when people at high risk are identified early.56
Routine scanning of bone mineral density is recommended in all women and in some guidelines also for men, but despite these recommendations, rates of screening remain low, leaving osteoporosis undiagnosed in many patients.78910 Furthermore, the criteria used for bone mineral density to identify those at high risk for osteoporotic fractures are not highly sensitive, as more than half of older women with osteoporotic fractures do not meet the bone mineral density criteria for osteoporosis (T score lower than −2.5).11
For these reasons, multiple risk assessment tools based on clinical and personal characteristics have been developed in recent years to identify those at high risk for osteoporotic fractures. The most studied tools are the World Health Organization’s FRAX, Garvan, and QFracture, which are all freely available online for public use.12 Each tool has been developed in different contexts, with FRAX and Garvan based on cohort studies using survey and doctor and patient reported data, and QFracture based on data from electronic health records.413141516 The extent to which each tool has been externally validated varies: FRAX has been validated by 26 studies in nine countries, Garvan by six studies in three countries, and QFracture by three studies within the United Kingdom and the Republic of Ireland.12 These three tools also differ in their complexity in terms of the number of input variables included, with QFracture using 26 variables, FRAX using 11, and Garvan using five. In addition, FRAX and Garvan offer predictions with or without the input of a pre-existing bone mineral density measurement, whereas QFracture does not include bone mineral density in its algorithm. Supplement 1 summarises the basic features of the three tools.
Although many predictive tools have been developed, few are used to support clinical decision making to identify patients at high risk for osteoporotic fractures.17 With increasing use of electronic health record systems there is the potential to produce automated personalised fracture risk scores to better direct treatment and reduce the overall burden of osteoporotic fractures. These risk scores can be presented both directly to the patients and made accessible to their doctors through the electronic health record system. Several studies have shown the benefits of improved management of osteoporosis and fracture prevention from electronic health records or electronic software based decision support implementations.1819 In determining which of the various prediction tools is adaptable for automatic implementation using electronic health record data, the predictive performance of each tool (both discrimination and calibration), the validation results in various populations, and the availability of types of data required for the tool must be considered.
Although numerous reviews have compared FRAX, Garvan, and QFracture,12202122 to the best of our knowledge their performance has not been directly compared within one population. A few studies have directly compared two of the tools in the same population.13232425 However, the only study to evaluate tool performance in a large population and among both men and women compared old versions of QFracture and FRAX.13 Several other studies purported to compare two tools but did not validate the predicted risk with observed events over a subsequent follow-up period.26272829
Several substantial pitfalls have been highlighted both in comparisons of performance across various tools and in external validations of specific tools. Problems with missing input variables, sample size, and the number of outcome events were noted as limiting the ability of validation studies to provide generalisable results and full validations of original tools.17 Most studies aiming to compare measures of tool performance were reviews or meta-analyses (not direct comparisons within one population) that relied on results of specific tool validations. The comparability of these validations has been critiqued, because different inclusion criteria and follow-up periods might affect their results.30 Age, for example, is a major determinant of fracture risk, and thus the choice of age ranges included in specific validation studies were suggested to substantially affect the reported performance of the tool. Furthermore, many of these validations did not present a comparison between the validation and derivation populations used to develop the tools, to shed light on the kind of validation they contribute—ranging between “reproducibility” (evaluating the tool within a population with similar characteristics) and “transportability” (evaluating the tool within a population of different characteristics).31 The lack of consistency in study designs among previous validation studies presents challenges in the ability to draw meaningful conclusions about which tool offers the best performance.
We compared the performance of the three most commonly studied fracture prediction tools in a single, large population when computed automatically based on electronic health record data. We also conducted a tool specific external validation in an independent population to evaluate the performance of the tools in populations with the same age range as those in which they were developed, thus allowing comparison with previously reported performance.
This study used electronic health record data from Clalit Health Services, the largest of four national health funds in Israel. All Israeli residents are covered by one of the health funds and can switch between them at any time; however, switching rates are relatively low—about 1% annually32—which allows for consistent longitudinal follow-up. Clalit Health Services is both a healthcare insurer and a provider, thus financing and supplying services to its 4.3 million members, which make up more than half of Israel’s population. Membership of Clalit Health Services comprises the general population, but for historical reasons the organisation has a slightly larger proportion of the older population and those from a lower socioeconomic class.33
In this historical prospective cohort study we compared the probability of hip fracture over five years using FRAX, QFracture, and Garvan, as well as the probability of major osteoporotic fractures over five years using FRAX and QFracture, computed on 1 January 2010 (index date), with fracture events observed up to 31 December 2014 (follow-up period).
In the first part of this study we compared the performance of the three tools, and thus selected a population in which the reported age ranges for all tools overlap. This comparative analysis was conducted for risk of hip fracture. Because the definition used by Garvan for major osteoporotic fractures is much broader than the one used by QFracture and FRAX (vertebral, distal radius, proximal humerus, or hip), we conducted additional analyses only between QFracture and FRAX to compare the performance for predicting major osteoporotic fractures. In the second part of the study we conducted a tool specific external validation for performance in predicting fractures, using cohorts with varying age ranges for each tool.
The comparative analysis was performed among members of Clalit Health Services aged 50 to 90 years as of the index date, who had at least three years of continuous membership before the index date and through the follow-up period or until death (see fig 1). Therefore, the cohort did not include those who were lost to follow-up. Although FRAX was developed among a population that excluded patients who were treated for osteoporosis,34 the other two tools were not, and thus for comparative purposes, treatment for osteoporosis was not used as an exclusion criterion (a population of non-treated patients was evaluated in a separate sensitivity analysis).
For the tool specific external validation analyses we used specified age ranges corresponding to those chosen in the original tool development studies: the QFracture analysis included members aged 30-100 years,16 the FRAX analysis included members aged 50-90 years,4 and the Garvan analysis included members aged 60-95 years14 (the official calculator computes risk for 50 or more years, which was the reason why we chose age 50 as the lower limit for the comparative analysis).35 The rest of the inclusion and exclusion criteria did not differ from those used in the comparative analysis (see fig 1).
To account for real world settings, in the populations of both analyses we included those who died during the follow-up period.
The electronic health record data at Clalit Health Services contain comprehensive administrative and clinical data. These include demographic information, diagnoses given in a community or a hospital setting, chronic disease and oncology registries, laboratory results, written prescriptions and prescriptions dispensed, clinical markers (eg, body mass index, smoking status), medical procedures, and imaging data.
Input variables included clinical status, prescription drug use, and demographic characteristics, according to the variables used in each of the tools. Supplement 2 lists the codes used to define diagnoses and drug based variables.
To provide as comprehensive data as possible for the prediction tools, we based all input variables of the three prediction tools on information that was last documented as of the index date. Most study variables represent chronic conditions and were consequently taken with no date limitation before the index date. For variables that could potentially change over time (including body mass index, smoking status, alcoholism, nursing home residency, history of falls, and drug use), we took the last relevant documented history with no time limitation, and we also conducted a sensitivity analysis in which the extraction of such variables was limited to two years before the index date. The sensitivity analysis was performed to establish the implications of not limiting the time from which variable data were taken.
Clinical diagnoses—Input variables for diagnosis included history of osteoporotic fractures, secondary osteoporosis, dementia, Parkinson’s disease, epilepsy, diabetes and other endocrine conditions, obstructive airways disease, cardiovascular disease, malabsorption, chronic liver disease, chronic kidney disease, rheumatoid arthritis, systemic lupus erythematosus, and documented history of falls. We extracted these diagnoses from community and hospital records, as well as from the Clalit Health Services chronic disease registry, when appropriate. Diagnoses were defined based on the International Classification of Diseases, ninth revision (ICD-9), International Classification of Primary Care (ICPC), and chronic disease registry codes. Diagnoses made in the community setting were further validated based on doctors’ accompanying free text diagnosis description, available only in the community records.
Body mass index—This variable was computed from documented weight and height measurements.
Smoking status— In the Clalit Health Services database, smoking status is defined as non-smokers, former smokers, or current smokers. In QFracture, three current smoking categories are provided according to the number of cigarettes smoked daily.36 To avoid the bias of categorising patients in one of the outlying categories, we assigned Clalit Health Services “current smokers” to the middle category (10-19 cigarettes daily). For FRAX’s two category smoking status, we assigned former smokers in our population to the non-smokers category, as was done in the cohorts used to develop FRAX.3738
Alcohol consumption—The Clalit Health Services database does not include information on alcohol intake, so we defined alcohol consumption as a dichotomous (yes or no) variable, based on diagnoses of alcoholism or alcohol induced chronic complications (ICD-10 codes for related psychiatric diagnoses were used for alcoholism in addition to the ICD-9 and ICPC codes). Of the five alcohol consumption categories provided by QFracture, we assigned individuals with alcohol related diagnoses to the fourth level category (7-9 units daily, using the UK’s definition of alcohol unit), since the lower categories were unlikely to cause alcohol related complications, and the highest category might overestimate the alcohol consumption for some of the relevant population. Given the inability to distribute individuals without alcohol related diagnoses to the various alcohol consumption levels, we assigned them to the “none” (ie, no alcohol intake) category.
Family history of fractures—A family history of osteoporosis and hip fractures was defined by diagnosis codes indicating such a history and by searching the medical records of the parents of study members, when the family connection was defined within the electronic health record and either parent was a member of Clalit Health Services.
Medication use—We computed variables for medication use, considered only in QFracture and FRAX, based on pharmacy dispensing data. Glucocorticoid use was defined differently by these tools—two prescription records in the last six months by QFracture versus current or past use for more than three months by FRAX. We therefore computed glucocorticoid use as two separate variables. Purchases of antidepressants and hormone replacement therapy medications were included only in the QFracture analyses.
Nursing home care—We considered an individual to be a nursing home resident when the patient’s primary clinic or treating doctor were administratively defined as institutional positions.
In cases where there was no documentation of body mass index, weight, or smoking status before the index date (the only variables for which missing data could be identified), we used multiple imputation to complete these values. We also performed a complete case sensitivity analysis without imputed variables.
Outcome variables included both hip fracture and major osteoporotic fractures, which were defined as fractures of the hip, vertebrae, distal radius, or proximal humerus. These variables were defined based on the records for clinical diagnoses.
Predictive tool risk computation
We computed the five year risk according to QFracture (2012 version) and Garvan based on their full tool equations.14163639 To ensure correct automation, we manually validated a few dozen cases against the official calculator sites. Since the current FRAX equations are not published by the authors, we used the FRAX 10 year probability charts calibrated for Israel, stratified by sex, age, body mass index, and number of clinical risk factors, as supplied by the official FRAX site.37 We multiplied the 10 year probabilities by 0.5 to convert to five year probabilities. The justification for this transformation was established by examining the rate of osteoporotic fracture events over a 10 year period, between 2005 and 2014 (see supplement 3 for further details). All tools were computed without the input of bone mineral density because QFracture does not include this variable and data on bone mineral density were limited in the electronic health record system for the study years.
To compare across the three tools, which were developed using different modelling methods, we used the provided risk probabilities for each tool respectively and treated the outcome as if it were a binary variable (fracture or no fracture). This decision was also guided by the clinical application of these risk predictions tools—that doctors and patients perceive the output as risk for the relevant follow-up period, regardless of the methods used to produce it. The closed cohort design facilitated this strategy of treating the outcome as a binary variable, because there was a known outcome for all study members in a fixed follow-up period of five years.4041 Since it is clinically important to test the accuracy of the predicted probability of fracture both for people who survive the follow-up period and for those with shorter life spans, we did not account for shortening of the follow-up period due to death.
To evaluate the overall ability of each tool to discriminate between those at low risk and those at high risk we used the area under the receiver operating curve (AUC) in both the comparative and the tool specific external validation analyses. We calculated other discriminatory measures—sensitivity, specificity, positive and negative predictive values, accuracy, and error—for the top 10% and 20% highest risk cut-offs of each tool. In three separate sensitivity analyses we further evaluated the discrimination measures in the comparative analysis: limitation of the time range of variables with less chronic nature, complete case analysis, and a subpopulation that excluded patients who were being treated for osteoporosis in the two years before the index date.
Since the AUC is considered a somewhat crude overall discriminatory measure, that might overlook the contribution of specific risk factors that are not prevalent in the population but are potentially clinically significant for an individual patient’s risk prediction,30 we conducted a reclassification analysis between the two tools with the highest AUCs in the comparative analysis. We report the total numbers of patients classified as low risk and high risk using a top 10% cut-off level for the two tools, as well as measures of net reclassification index analysis.42 The net reclassification index for events is the rate of events that were correctly reclassified as high risk by the tested tool (usually the tool that incorporates more risk factors) minus the rate of events wrongly reclassified as low risk. The net reclassification index for non-events is the parallel measure, and is the rate of non-events that were correctly reclassified as low risk minus the rate of non-events that were wrongly reclassified as high risk. The overall net reclassification index is the combination of net reclassification index for events and net reclassification index for non-events, whereas the more intuitive weighted net reclassification index is the combination of the same values weighted by the relative size of the groups they represent.43 We calculated standard errors for all net reclassification index values.44
We assessed the calibration of each tool by comparing the average predicted risk with the observed percentage of those who experienced fractures over the follow-up period, stratified by age groups and separately by 10ths of fracture risk. To provide calibration measures that are not based on grouping of individuals into strata, we compiled calibration aparametric curves, calibration slopes, and calibration-in-the-large values45 using functions by Harrell et al46 and added these to calibration plots.
Multiple imputation was conducted using 10 iterations and 20 multiple imputations, thus creating 20 full datasets, using functions by Van Buuren et al.47 We performed all analyses separately on each of these imputed datasets and averaged these to determine the final performance measures. A 95% confidence interval for AUC measures of specific prediction tools as well as for the differences between tools was calculated using Rubin’s rules for variance estimation in multiple imputed datasets4548 (by taking into account both the AUC variance of 1000 bootstrap samples within each imputed dataset and the variance of the 20 average AUCs between the imputed datasets). Owing to the nature of the net reclassification index analysis, this analysis was only based on one random imputed dataset. Plots were created using a combined dataset that included all of the separate imputed datasets.
No patients were involved in setting the research question or the outcome measures, nor were they involved in developing plans for design or implementation of the study. No patients were asked to advise on interpretation or writing up of results. There are no plans to disseminate the results of the research to study participants or the relevant patient community.
As of 1 January 2010, 1 085 104 members of Clalit Health Services were aged 50 to 90 years. Of those, we excluded 30 289 (2.8%) because they did not meet the criteria for continuous membership (fig 1⇓). The final population for the comparative analysis consisted of 1 054 815 people (54.6% women). This population included 28 091 (2.7%) who experienced a hip fracture and 81 564 (7.7%) who experienced a major osteoporotic fracture during the follow-up period (table 1⇓). Supplement file 4 provides specific fracture rates stratified by age and sex. A total of 113 591 (10.8%) people died during the follow-up period. The average length of follow-up was 4.73 years, with 4 990 557 total person years of follow-up. Overall, 54 849 (5.2%) of the records were imputed for weight and body mass index values, and 34 921 (3.3%) were imputed for smoking status. Table 1⇓ lists the characteristics of the study population by input variables of the three prediction tools, the outcome fracture rates, and which variables were included in each tool.
In examining the comparative performance across the tools, QFracture had the highest AUC for hip fracture prediction (82.7%, 95% confidence interval 82.4% to 82.9%), followed closely by FRAX (81.5%, 81.3% to 81.7%). Garvan’s AUC for hip fracture prediction (77.8%, 77.5% to 78.1%) was lower (table 2⇓). The confidence interval for the difference between the QFracture and FRAX AUCs was 1.0-1.3%, whereas the confidence interval for the difference between the QFracture and Garvan AUCs was 4.7-5.1%. Among the highest 10% risk for hip fracture, as predicted in 2010, QFracture identified 45.1% (sensitivity) of those who went on to experience a hip fracture, FRAX 43.6%, and Garvan 36.9%. By targeting those in the 20% highest risk for hip fracture, QFracture identified 68.9% of hip fractures, FRAX 65.8%, and Garvan 57.1%. The specificity and negative predictive values were high and comparable for all three tools (table 2⇓).
The QFracture and FRAX discriminatory measures for prediction of major osteoporotic fractures were lower than those for hip fracture prediction. AUCs for both tools were close (QFracture: 71.2%, 71.0% to 71.4%; FRAX: 71.4%, 71.2% to 71.6%, table 2⇑). The confidence interval for the difference between the FRAX and QFracture AUCs was 0.1-0.3%. The sensitivity for the top 10% highest risk group was 26.7% for QFracture, compared with 29.0% for FRAX, and the positive predictive value was 20.7% for QFracture and 22.4% for FRAX. Figure 2⇓ presents the comparisons of the receiver operating curves for all three tools in predicting hip fractures and for QFracture and FRAX in predicting major osteoporotic fractures.
The results from the three sensitivity analyses were consistent on the relative performance of the tools in their discriminatory measures to that of the main analysis. Analyses limiting variable data collection to the two years before the index date can be found in supplement 5. Complete case analyses are in supplement 6, and analyses of non-treated patients in supplement 7.
We conducted a reclassification analysis between QFracture and FRAX (the two tools that yielded the highest AUCs) to compare how these tools categorised patients into low risk and high risk groups. QFracture, which incorporates more risk factors than FRAX in its prediction model, was considered the “reclassifying” model in the analysis, so that we could evaluate the prediction increment offered by its added risk factors (table 3⇓). The net proportion of patients who experienced a hip fracture and were correctly reclassified as high risk by QFracture compared with FRAX was 1.50% (net reclassification index for events). The net proportion of patients who experienced a major osteoporotic fracture and were correctly reclassified as high risk by QFracture was −2.31% (net reclassification index for events). For both types of outcomes, the change in the correct reclassification of non-events was less than 0.2%. The net changes in the proportion of patients assigned a more appropriate risk category for prediction of hip fracture and major osteoporotic fracture by QFracture were 0.08% and −0.36%, respectively.
Table 4⇓ presents the absolute probabilities of hip fracture that were calculated by each of the three tools, and the calibration of these probabilities with the absolute fracture rates that were observed over the five year follow-up period, by sex and age groups. A majority of the observed-to-predicted ratios for hip fractures were greater than 1, indicating underestimation of the risk by all three tools for both men and women and in almost all age groups. The QFracture and Garvan ratios presented a consistent downward trend with the increase in age groups but were steadier across the different age groups for FRAX. The risk underestimation was most prominent for women in Garvan. In addition, Garvan was the only tool to assign lower mean predicted probabilities for women compared with men in the same age groups. The observed-to-predicted ratios by 10ths of risk and sex were also more consistent for FRAX compared with QFracture and Garvan, which presented declining ratios as risk increased (table 5⇓). Figure 3⇓ presents a calibration plot, presenting the observed and predicted rates for each 10th of risk, along with aparametric calibration curves, calibration slopes, and calibration-in-the-large values.
The tool specific external validation analyses consisted of three different cohorts (fig 1⇑): the FRAX validation population was identical to the comparison analysis population (members aged 50-90 years), the QFracture population consisted of 1 896 413 members, aged 30-100 years, and the Garvan population included 670 435 members, aged 60-95 years. The population of the QFracture external validation included 31 709 (1.7%) individuals who experienced a hip fracture and 99 058 (5.2%) individuals who experienced a major osteoporotic fracture during the follow-up period. The corresponding rates for the population of the Garvan external validation were 27 897 (4.2%) and 68 859 (10.3%), respectively. Supplement 8 provides a comparison of the prevalence of the risk factors between the populations used to develop the tools (derivation cohorts), and the population of the tool specific external validations in the current study for QFracture16 and FRAX.38 The prevalence of risk factors as defined in the final Garvan model were not available for the original Garvan population.1415 The current study’s QFracture tool specific population was relatively older than QFracture’s derivation cohort and was characterised by a greater prevalence (or greater capture rates) of most risk factors. In contrast, the current study’s FRAX tool specific population was similar in age to FRAX’s derivation cohort, with a smaller share of women and lower prevalence (or lower capture rates) of risk factors.
AUC values for hip fracture in the validation analyses were 88.0% (95% confidence interval 87.8% to 88.2%) for QFracture, 81.5% (81.3% to 81.7%) for FRAX, and 71.2% (70.9% to 71.5%) for Garvan (table 6⇓). The Garvan hip fracture tool was the only one to present sex specific AUC and sensitivity values that were both higher than the overall values. Figure 2⇑ presents the comparisons of the receiver operating characteristic curves for the tool specific external validations. Supplement 9 provides calibration analyses for age and 10ths of risk groups for each of the tool specific external validation cohorts.
This study included over one million adults aged 50-90 in a single, general population and directly compared the three most studied fracture prediction tools in an electronic health record system. The discriminatory performance according to the area under the receiver operating curve (AUC) of hip fracture scores for both FRAX and QFracture was high, with the latter performing slightly better, followed by a moderate performance of Garvan. Discriminatory measures for the prediction of major osteoporotic fractures were lower overall than for hip fracture prediction, with very close AUC measures for FRAX and QFracture. Three different sensitivity analyses (see supplements 5-7) examining the impact of input data definitions as well as a different population definition among patients naïve to osteoporosis treatment, have all supported these findings. Given that small differences in the overall AUC (as observed between QFracture and FRAX) may not reflect the entire difference in the discriminative performance for individual patients with a unique set of risk factors, we evaluated the reclassification of individuals between these tools. In examining the value gained from the additional risk factors included in QFracture compared with FRAX, reclassification analysis showed that QFracture had an overall 0.08% net increase and a 0.36% net decrease in the proportion of patients assigned a more appropriate risk category for hip fractures and major osteoporotic fractures, respectively. The combination of these results suggests an overall similar discriminatory performance for QFracture and FRAX, with a small advantage in hip fracture prediction for the former and a small advantage in major osteoporotic fracture prediction for the latter.
The tool specific external validation analyses presented comparable results to those reported in previous individual tool validations of the same age ranges. Despite the identical age ranges that were used for the tool specific external validations, the populations still differed to some extent from the derivation cohorts to which they were compared in terms of overall average age, sex distribution, and prevalence of risk factors (see supplement 8). In addition, the FRAX derivation cohort excluded patients treated for osteoporosis, but the current study found very similar results for FRAX when tested in a cohort with and without these patients (see supplement 7). Owing to these differences, our tool specific external validation analyses provided evidence for the transportability of the tools when considering the spectrum of external validation studies ranging from reproducible to transportable. Furthermore, by comparing performance gaps between tools both in the same population and in populations of different age ranges, our analyses substantiated previous claims of a strong correlation between age spans of the studied population and the observed performance of the tested tool.1230
In an analysis of the calibration measures, FRAX presented the best observed-to-predicted ratios, with the weighted average closest to 1, both across age groups and across predicted risk 10ths. Additionally, the calibration slopes of FRAX were closest to 1, representing better calibration across individuals, on top of the better calibration among groups. The FRAX calibration ratios were also relatively stable, whereas QFracture and Garvan presented a decline in the observed-to-predicted ratios as age increased. A possible contributor to FRAX’s more consistent observed-to-predicted ratio across age groups is that it accounts for the competing risk of death, whereas Garvan and QFracture do not.4 The integration of competing death risk into fracture prediction simulates real world behaviour by assigning lower predicted fracture rates for groups of individuals who have lower life expectancy, such as older people. The issue of whether competing risk of death should be incorporated into fracture prediction tools has been debated in the literature, with some studies accounting for it and others not.202550515253 Our comparative results observed within a single real world population illustrate that calibration is relatively more consistent when competing risk is incorporated. The observed-to-predicted ratios of QFracture and Garvan also presented a declining trend over 10ths of risk. The trend observed over 10ths is at least in part likely explained by age, since higher risk 10ths contained a larger share of older people (data not shown). The overall better calibration of FRAX may also be due to the use of country specific probability charts provided by FRAX.
Strengths and limitations of this study
This study has several strengths in its methods, analyses, and findings and implications for practical real world application. Firstly, we directly compared three well established fracture prediction tools in the same population, thus measuring the differences in performance with minimal effect of confounding. Secondly, the population used for this study was large, had many fracture events (>100 000 major osteoporotic and hip fracture events), included both men and women, and was nationally representative, thereby minimising selection biases. In addition, the population included those who died before the end of follow-up, which simulates real life use of the tools. To our knowledge, no previous study outside of the UK and the Republic of Ireland has validated QFracture in an independent population. Additionally, the tool specific validation analyses allowed the presentation of comparable results to previous reported performance of the specific tools, thus assuring the use of relevant population and input data and further strengthening the results of the comparative analysis.
By testing the applicability and performance of the three predictive tools using data from electronic health records, this study confirmed that the tools are transferrable to an electronic health record system with the potential for automated large scale implementation, even though two of them were not originally developed in this setting. Any organisation that aims to implement these tools into an electronic health record system must make adaptations according to the available data in its database. Despite having to adapt the data according to our electronic health record system for some of the variable categorisations used by the different tools, we observed comparable performance to those found in previous studies across all tools. This shows that the application of these tools in an external electronic health record system can be replicated in other contexts. As further evidence that these tools are applicable to an environment of electronic health records, we observed that the fracture rate in strata based on 30 out of 33 total variables considered among the three tools was higher among the strata with known risk factors (see table 1⇑). There were, however, three variables that yielded patterns contrary to the expected direction for fracture risk. Two of these (family history of osteoporosis and hip fractures) were reflections of data limitations, specifically lower rates of well defined family connections for older adults. Smoking status, the third variable that did not follow the anticipated direction, was possibly affected by confounding, since younger men were more likely to be current smokers in the study population (data not shown).
This study has several limitations. While previous studies commonly report probabilities for 10 years of follow-up, we were only able to evaluate the probabilities of fracture risk for five years owing to limited availability of robust baseline data as of 1 January 2005. It has been noted that AUC performance can be potentially affected by the duration of follow-up.1230 To address this point, we conducted a preliminary analysis and found that the rate of fracture events is approximately constant, meaning that the cumulative rate of events is linear, as presented in our supplementary material as well as in previous studies.23 This trend not only substantiates the conversion of 10 year FRAX probabilities into five year probabilities, but also supports an assumption that the performance of the five year probabilities likely reflects the performance of the 10 year probabilities. Furthermore, from a clinical perspective, the five year probabilities might be more useful for prevention and intervention, since the long term safety of bisphosphonates for fracture prevention after five years is unclear.5455 Secondly, the evaluation of FRAX relied on probability charts, which provide a cruder risk assessment than complete tool equations. Since FRAX achieved good discrimination, which aligns well with previous studies, it is reasonable to conclude that this limitation did not substantially affect its performance. Finally, as is often the case in electronic health record databases, we did not have extensive data on bone mineral density (in a large enough proportion of the overall population throughout the study timeline) to include it as an input variable for the tools that do offer a bone mineral density option. Given that bone mineral density screening is not performed in a substantial part of the adult population,78910 and the fact that QFracture does not include bone mineral density as an input, we have reason to believe that despite this omission, our results are meaningful for practical application.
Comparison to existing research
The AUC results for prediction of hip fracture in the tool specific external validation cohort were comparable to those reported in the development and validation studies for the second version of QFracture, which were 89% for women and 87-88% for men.1656 Since the original publications on FRAX and Garvan did not include AUC results for hip fracture prediction without bone mineral density,414 we identified other validation studies of these tools that reported this measure and were performed in populations comprising the same age ranges as the original development studies. AUC results for hip fracture prediction in such FRAX validations ranged from 77% to 79% for both sexes,5758 and 83% for women alone,52 comparable to the 82% that we report for both sexes and for women alone. The AUCs for Garvan from validations that were found relevant for this comparison were 76% in both sexes25 or 70% and 69% in women and men, respectively,39 compared with our study’s results of 71% in both sexes or 76-77% in women and men separately.
Comparisons of the three tools from reviews and meta-analyses have often concluded that the discrimination performance of QFracture is substantially better than that of FRAX. The conclusion that QFracture has substantially higher performance than other tools has also been cited in practice guidelines, such as the Scottish Intercollegiate Guidelines Network, as justification for using QFracture.59 However, these conclusions and practice recommendations are based on studies that did not use consistent inclusion criteria, such as sex and age ranges. While we also found QFracture to have better discrimination than either of the other two tools for hip fracture prediction, it was to a lesser extent than previously reported. Additionally, the QFracture discrimination for major osteoporotic fractures was slightly lower than that of FRAX. The possibility that differences in age ranges can substantially affect the observed discriminatory performance of a tool has been previously suggested.30 Yet this has not been shown within a real world population based study by comparing the performance of one tool in different age ranges. Our tool specific external validation results show that AUC and sensitivity values tended to be higher in populations with wider age ranges than in populations with narrower age ranges (fig 2⇑). For example, the overall AUC of QFracture for hip fracture prediction, which spanned over a 70 year age range, was higher than that of FRAX and Garvan, which spanned over age ranges of 40 and 35 years, respectively. Additionally, by comparing the performance of a specific tool between the comparative and validation analyses, we illustrated how an expansion in age ranges resulted in higher observed discrimination (QFracture, with AUCs of 82.7% v 88.0%, respectively), and how a narrowing of the age range results in lower observed discrimination (Garvan, with AUCs of 77.8% v71.2%, respectively). In examining the receiver operating characteristic curves for each tool among a population with a narrower age range and among a population with a broader age range, our findings illustrate the direct effect that age has on the performance of these tools (fig 2⇑).
Conclusions and practice implications
Current guidelines for the prevention of osteoporotic fracture incorporate fracture prediction tools; the UK’s National Osteoporosis Guideline Group (NOGG)60 supports the use of age dependent risk cut-offs as an indication for treatment, regardless of bone mineral density, and a lower set of cut-offs as an indication for performing bone mineral density scanning. Another approach, adopted by the National Osteoporosis Foundation (NOF) in the US,55 recommends using a constant cut-off as indication for treatment in osteopenic patients (with borderline bone mineral density). However, even with clear recommendations from the professional committees, uptake in the adoption of these tools in practice is not widespread.1761 This is in part due to the lack of an automated decision support infrastructure that allows clinicians easy access to patients’ risk for fractures. The current study showed that two of the three tools assessed offer good discrimination for hip fracture prediction using electronic health record data, which could be incorporated into the electronic health record system and automatically raise an alert for clinicians when a patient is indicated to be at high risk.
Our results were consistent with previous findings that the best discrimination was associated with prediction of hip fractures,1220 which are known to be associated with the greatest morbidity and mortality.62 Because the calibration can be corrected for a local population,63 the selection of a tool should primarily be based on its discriminative ability. In an electronic health record system where all input variables are available for automatic implementation, our study suggests that QFracture yields the best discriminatory performance for hip fracture prediction. If some of the input variables are not available, FRAX, which performed almost as well, despite not being developed using data from electronic health records, could be a simpler option for implementation of decision support. However, to identify those at risk for all major osteoporotic fractures, FRAX yielded slightly better discrimination, and thus may be preferable. The selected tool should only be used for the age ranges for which it was developed. Since all tools underestimated the risk of hip fractures in our population, the selected tool will require local calibration when implemented in practice, as the guidelines are based on specific risk cut-offs. This calibration will be more straightforward for FRAX, which showed steadier calibration but will require age dependent calibration of the other tools.
To achieve the potential utility of fracture prediction tools in clinical practice and adhere to the fracture prevention guidelines, these tools likely need to be automatically incorporated into electronic health record systems and brought to the attention of primary care doctors only when an action is required, without imposing any additional time burden. This study has shown that automatic implementation of the tools into an external electronic health record system is feasible, and has provided recommendations as to which tool is preferred under different circumstances. Additionally, our findings emphasise the importance of carefully comparing the performance of prediction tools of any kind in similar populations, and, if possible, in the same population. Further research is warranted to evaluate whether automatically generated fracture risk scores made accessible directly to patients or their doctors would increase screening for and treatment of osteoporosis, and ultimately prevent osteoporotic fractures.
What is already known on this topic
Tools for prediction of osteoporotic fractures are recognised by leading guidelines as an important component of osteoporosis prevention but are underutilised
Of the three most studied fracture prediction tools—QFracture, FRAX, and Garvan—QFracture was the only one developed using data from electronic health records
The adaptation of these tools for automatic implementation in external electronic health record systems is not clear, nor is their relative performance
What this study adds
Automatic computation of all three tools using data from external electronic health records produced similar results, as has been previously reported, for each of the tools separately (tested in separate cohorts of the same age ranges as the derivation cohort of each tool)
When evaluated using one cohort (for which the age ranges of all tools overlap), QFracture and FRAX yielded high discriminatory performance for hip fracture prediction, with QFracture performing slightly better
This performance gap was much smaller than previously reported by reviews, which compared results from validation studies that tested each of the tools using different age ranges
We thank our colleagues at the Clalit Research Institute: Sydney Krispin and Carly Davis for their assistance in editing and reviewing the manuscript and Amichay Akriv and Moshe Hoshen for their guidance on the statistical analyses.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Funding: This study was supported internally by the Clalit Research Institute for the design and conduct of the study, data collection, analysis and interpretation of the data, and preparation and review of the manuscript.
Contributors: ND, RDB, and ML-R conceived and designed the study. ND, CC-S, and ML-R analysed and interpreted the data. ND and CC-S drafted the manuscript. All authors critically revised the manuscript for important intellectual content. ND carried out the statistical analysis. CC-S provided administrative, technical, and material support. RDB supervised the study and is the guarantor.
Data sharing: No additional data available.
Ethical approval: This study was approved by the Clalit Health Services research ethics committee.
Transparency: The lead author (ND) confirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/.