This article has a correction
- Seena Fazel, Wellcome Trust senior research fellow in clinical science1,
- Jay P Singh, postdoctoral research fellow2,
- Helen Doll, statistician3,
- Martin Grann, professor4
- 1Department of Psychiatry, University of Oxford, Warneford Hospital, Oxford OX3 7JX, UK
- 2Department of Mental Health Law and Policy, University of South Florida, Tampa, FL, USA
- 3Department of Population Health and Primary Care, University of East Anglia, Norwich, UK
- 4Centre for Violence Prevention, Karolinska Institutet, Stockholm, Sweden
- Correspondence to: S Fazel
- Accepted 15 June 2012
Objective To investigate the predictive validity of tools commonly used to assess the risk of violence, sexual, and criminal behaviour.
Design Systematic review and tabular meta-analysis of replication studies following PRISMA guidelines.
Data sources PsycINFO, Embase, Medline, and United States Criminal Justice Reference Service Abstracts.
Review methods We included replication studies from 1 January 1995 to 1 January 2011 if they provided contingency data for the offending outcome that the tools were designed to predict. We calculated the diagnostic odds ratio, sensitivity, specificity, area under the curve, positive predictive value, negative predictive value, the number needed to detain to prevent one offence, as well as a novel performance indicator—the number safely discharged. We investigated potential sources of heterogeneity using metaregression and subgroup analyses.
Results Risk assessments were conducted on 73 samples comprising 24 847 participants from 13 countries, of whom 5879 (23.7%) offended over an average of 49.6 months. When used to predict violent offending, risk assessment tools produced low to moderate positive predictive values (median 41%, interquartile range 27-60%) and higher negative predictive values (91%, 81-95%), and a corresponding median number needed to detain of 2 (2-4) and number safely discharged of 10 (4-18). Instruments designed to predict violent offending performed better than those aimed at predicting sexual or general crime.
Conclusions Although risk assessment tools are widely used in clinical and criminal justice settings, their predictive accuracy varies depending on how they are used. They seem to identify low risk individuals with high levels of accuracy, but their use as sole determinants of detention, sentencing, and release is not supported by the current evidence. Further research is needed to examine their contribution to treatment and management.
With the increasing recognition of the public health importance of violence,1 2 the prediction of violence, or violence risk assessment, has been the subject of considerable clinical and research interest. Since the late 1980s, such assessment has mostly been conducted by structured instruments after several studies found unstructured clinical opinion to have little evidence in support.3 Recent surveys have estimated that over 60% of general psychiatric patients are routinely assessed for violence risk,4 rising to above 80% in forensic psychiatric hospitals.5
The widespread use of these tools has been partly driven by public concern about the safety of mentally ill patients,6 research evidence that severe mental illness is associated with violence,7 8 9 and clinical practice guidelines in some countries, including the United Kingdom and United States,10 11 recommending violence risk assessment for all patients with schizophrenia. Furthermore, criminal justice systems in many countries have welcomed the use of risk assessment to assist sentencing and release decisions. Risk assessment has been used to inform indeterminate sentencing in the UK,12 and has become a largely uncontested part of an expanded criminal justice process in the US.13 Furthermore, a 2004 survey reported that of the 32 US states where parole is an option, 23 had used such instruments as part of these decisions.14
The current group of risk assessment tools either provide a probabilistic estimate of violence risk in a specified time period (actuarial instruments), or allow for a professional judgment to be made on risk level (for example, low, moderate, or high) after taking into account the presence or absence of a predetermined set of factors (structured clinical judgment instruments). Over 150 of these structured measures currently exist,15 and are starting to be implemented in low and middle income countries.16 17
However, these tools are time consuming and resource intensive, typically taking many hours to complete by a multidisciplinary group of professionals.18 They can also be expensive; training is required for most tools, and payment is often needed for individual use. Further, and more importantly, the instruments’ predictive accuracy remains a source of considerable uncertainty, with some reviews recommending their use in clinical and correctional settings and others finding that they lead to an unacceptably high number of false positive decisions.18 19 20 21 22 Expert opinion is equally divided.23 24 25
We have therefore conducted a systematic review and meta-analysis of the predictive accuracy of the most commonly used risk assessment instruments. To consistently report outcomes for individual studies, we requested tabular data from primary authors. We have synthesised these data across a range of accuracy estimates, one of which was developed for the purposes of this review.
We followed the preferred reporting items for systematic reviews and meta-analyses statement.26
Risk assessment tools
We identified the nine most commonly used tools risk assessment using recent reviews27 28 29 and questionnaire surveys.30 31 Actuarial instruments included the Level of Service Inventory-Revised (LSI-R),32 the Psychopathy Checklist-Revised (PCL-R),33 34 the Sex Offender Risk Appraisal Guide (SORAG),35 36 the Static-99,37 38 and the Violence Risk Appraisal Guide (VRAG).35 36 Structured clinical judgment tools included the Historical, Clinical, Risk management-20 (HCR-20);39 40 the Sexual Violence Risk-20 (SVR-20);41 the Spousal Assault Risk Assessment (SARA);42 43 44 and the Structured Assessment of Violence Risk in Youth (SAVRY).45 46 We divided tools into three categories: those designed to predict violent offending (HCR-20, SARA, SAVRY, and VRAG), sexual offending (SORAG, Static-99, and SVR-20), and any criminal offending (LSI-R and PCL-R). Although the PCL-R was originally developed to diagnose psychopathic personality disorder, it has become widely used for risk assessment purposes, as numerous studies have found the PCL-R score to be statistically significantly associated with criminal and antisocial outcomes.47 Table 1⇓ reports additional details of all the instruments. Although these instruments were mostly designed to predict the likelihood of offending, we included violent, sexual, and antisocial outcomes (based on clinical records and other measures) even if they did not lead to convictions. For the sake of consistency, however, we refer to all outcomes as offences.
A systematic search was conducted to identify studies that measured the predictive validity of the nine tools. We searched the following databases between 1 January 1995 and 1 January 2011 using acronyms and full names of the instruments as keywords: PsycINFO, Embase, Medline, and US National Criminal Justice Reference Service Abstracts. Additional studies were identified through references, annotated bibliographies, and correspondence with researchers in the field. Studies in all languages and unpublished investigations were considered for inclusion. We excluded studies if they measured the predictive validity of select scales of a measure, if instruments were coded retrospectively without blinding to outcomes, or if they were calibration studies for the actuarial tools (which may give inflated effects).48 When studies used overlapping samples, we used the sample with the largest number of participants to avoid double-counting. Using this search strategy, we identified 251validation studies (web figure 1).
To be included in the meta-analysis, studies had to report rates of true positives, false positives, true negatives, and false negatives at a given cut-off score for the outcome which the instrument was designed to predict. A pilot study showed that different score thresholds were used to classify people as being at low, moderate, or high risk of future offending. We contacted study authors and asked them to complete a standardised form if tabular data using the cut-off scores recommended in the most recent version of an instrument’s manual were unavailable, or if the number of participants classified as low, moderate, or high risk was missing from a study of a structured clinical judgment tool. For publications in which multiple tools designed to predict the same outcome were tested on the same sample (eight studies), we requested tabular data for all outcomes but only included those for the tool with the fewest replication studies to increase the breadth of the findings. This procedure probably did not bias results, since χ2 tests of differences in proportions found no differences in rates of true and false positives and true and false negatives in the tabular data obtained for included and excluded tools from the same study with the same outcome.
Standardised outcome data were available in the manuscripts of 30 eligible studies (32 samples). We requested additional data from the authors of 174 studies (330) and obtained data for 52 studies (62). Accuracy estimates from 235 of those 268 samples for which we were unable to obtain data were converted to Cohen’s d using standard methods.49 50 51 The median d value produced by those samples for which we could not obtain data (0.67, interquartile range 0.45 to 0.87) was similar to that of the 94 obtained samples (0.74, 0.54 to 0.95) (web figure 2 shows distribution of effect sizes). In addition, the Hodges-Lehmann percentile difference,52 the median difference between all possible pairs of d values from the two groups, was small (0.01, 95% confidence interval 0.00 to 0.08). Finally, of the 82 studies for which tabular data was obtained, we were able to include information from 68 (73 samples; references available in web appendix), since the other 14 studies used instruments to predict outcomes other than those for which they were designed.
We followed the current guidance provided by the Cochrane collaboration for systematic reviews of diagnostic test accuracy.53 The statistical methods for such reviews focus on two statistical measures of diagnostic accuracy of the test: sensitivity (the proportion of offenders who a risk assessment tool predicted to offend) and specificity (the proportion of non-offenders who a risk assessment tool predicted would not offend). The aim of the analysis was to quantify and compare these statistics as well as the error rates (false positive and false negative diagnoses) for each type of test. The required analysis is a bivariate analysis of sensitivity and specificity for each study accounting for correlation between sensitivities and specificities.54 The resulting model without covariates is a different parameterisation of the hierarchical summary receiver operating characteristic model.55 We used summary receiver operator characteristic plots to display the results of each study in receiver operating characteristic space, plotting each study plotted as a single sensitivity-specificity point. Parameter estimates from the bivariate model produced a summary receiver operating characteristic curve with a summary operating point (that is, summary values for sensitivity and specificity), 95% confidence region, and 95% prediction region. We used the summary point from each curve to calculate the summary diagnostic odds ratio and both the sensitivity and specificity, each with 95% confidence intervals.
Since binary test outcomes are defined on the basis of a cut-off value for test positivity, we chose these values a priori. Risk assessment tools are predominantly used in clinical situations as instruments for identifying higher risk individuals,19 thus, we combined participants who were classified as being at moderate or high risk for future offending and compared them with those classified as low risk. We did secondary analyses by comparing participants classified as high risk with those classified as low or moderate risk, an approach consistent with screening, and also by completely excluding those classified as moderate risk.
We used a range of accuracy estimates to report on the predictive validity of the risk assessment tools. Firstly, the summary operating point was used to estimate the summary diagnostic odds ratio and both sensitivity and specificity. We obtained estimates for the area under the curve, positive predictive value, negative predictive value, number needed to detain, and number safely discharged from the individual sample estimates.
The diagnostic odds ratio is the ratio of the odds of a positive test result in an offender relative to the odds of a positive result in a non-offender, and is recommended for use with diagnostic instruments.56 The area under the curve is an index of sensitivity and specificity across score thresholds, and is currently considered the accuracy estimate of choice in violence risk assessment when measuring predictive accuracy.57 Neither the diagnostic odds ratio nor the area under the curve are affected by base rates of offending. The positive predictive value is the proportion of participants classified as at risk who go on to offend, whereas the negative predictive value refers to the proportion of those classified as not at risk who do not go on to offend. The number needed to detain is the number of people judged to be at risk who would need to be detained to prevent one incident of subsequent violence.19 58 This outcome allows some quantification of the implications of using risk assessment tools to make detention decisions. Finally, the number safely discharged is a new performance statistic that we developed for the purposes of this review. This accuracy estimate calculates the number of participants judged to be at low risk who could be discharged into the community before a single act of violence occurs (1÷[1−negative predictive value]−1). A complement to the number needed to detain, the number safely discharged, allows researchers to quantify the implications of relying on a risk instrument to make discharge or release decisions.
Tests of assumptions
Standard meta-analytic pooling assumptions were met for diagnostic odds ratios and both sensitivity and specificity.59 60 Since there was a significant correlation between the sensitivities and specificities produced by the samples in each class of risk assessment tools, pooling assumptions for areas under the curve were not met.60 In addition, because the median base rate of offending within each class of tools varied considerably (violence 32.0%, interquartile range 22.2-46.6%; sexual 16.9%, 7.4-28.2%; criminal 28.4%, 20.7-46.0%), base rate dependent statistics were not pooled (such as the positive and negative predictive values and both the number needed to detain and the number safely discharged), and medians with interquartile ranges were calculated.
Investigation of heterogeneity
The standard Q and I2 statistics61 do not account for heterogeneity explained by phenomena such as positivity threshold effects, and the numerical estimates of the random effect terms in the bivariate regression are not easily interpreted. Therefore, the magnitude of observed heterogeneity in meta-analyses of diagnostic accuracy is instead best determined by the scatter of points in the summary receiver operating characteristic plot and from the prediction ellipse.53 In particular, the prediction region depicts a region within which, assuming the model is correct, we have 95% confidence that the true sensitivity and specificity of a future study should lie.53
Since the diagnostic odds ratios met pooling assumptions, we used random effects metaregression to investigate sources of heterogeneity between studies in sample diagnostic odds ratios for each class of tools. Metaregression investigates the relation between accuracy estimates and dichotomous or continuous sample or study characteristics.62 We formally explored the moderating role of the following variables: sex (proportion of sample that was male), ethnicity (proportion of sample that was white), mean participant age, type of instrument (actuarial v structured clinical judgment), temporal design (prospective v retrospective), setting in which assessment was conducted (correctional, forensic psychiatric, general psychiatric, or mixture), location of offending outcome (community only v inside institution or other), mean length of follow-up (months), sample size, and publication status (published in peer reviewed journal v not). We also conducted subgroup analyses using the bivariate models on these variables. Detailed examination of the overall differences between individual instruments have been reported in a subset of the samples.63 We did all analyses in Stata 10.264 using the metandi (for bivariate model meta-analysis), metan (random effects meta-analysis), and metareg (metaregression) commands.
We collected information for 24 847 participants in 73 samples from 68 independent studies (table 2⇓). Standardised outcome information from 43 of the samples (14 798 (59.6%) participants) was not reported in manuscripts and obtained directly from study authors. Of 24 847 participants, 5879 (23.7%) offended over an average of 49.6 months (standard deviation 40.5). Studies were conducted in 13 countries: Austria, Belgium, Canada, Denmark, Finland, Germany, the Netherlands, New Zealand, Serbia, Spain, Sweden, the UK, and the US.
We found differences in estimates of predictive accuracy depending on the type of risk assessment instrument (violence, sexual, or any criminal). Overall, based on diagnostic odds ratios, violence risk assessment tools performed best, and had higher positive predictive values than tools aimed at predicting sexual offending. Risk assessment instruments for violence and sexual offending produced high sensitivities and negative predictive values. In addition, risk assessment instruments for general offending had lower diagnostic odds ratios, areas under the curve, sensitivities, and negative predictive values and higher specificities and positive predictive values than the other two classes of instrument (table 3⇓, figs 1-3⇓ ⇓ ⇓).
For assessment instruments predicting the risk of violent outcomes, the summary diagnostic odds ratio was 6.1 (95% confidence interval 4.6 to 8.1) with moderate levels of heterogeneity (individual points moderately scattered in receiver operating characteristic space, fig 1) and a median area under the curve of 0.72 (interquartile range 0.68-0.78; table 3). Of those individuals who went on violently offend, 92% (95% confidence interval 88% to 94%) had been classified as being at moderate or high risk of future violence (that is, sensitivity). Of those participants who did not go on to violently offend, 36% (28% to 44%) had been judged to be at low risk (that is, specificity). Of those predicted to violently offend, 41% did (interquartile range 27-60%; positive predictive value), which was equivalent to a median number needed to detain of two (two-four). Of those who were predicted not to violently offend, 91% did not (81-95%; negative predictive value), equivalent to a median number safely discharged of 10 (four to 18).
Similar findings were obtained when individuals judged to be at moderate risk were grouped with those judged to be at low risk for the secondary analyses, but with considerably higher specificities and lower sensitivities (web table 1). When moderate risk individuals were excluded from analyses, assessment tools for violence risk produced considerably larger summary diagnostic odds ratios (16.8, 10.8 to 26.3) and specificities (0.72, 0.63 to 0.80).
Investigation of heterogeneity
Since we saw moderate levels of heterogeneity for the instruments assessing violence risk and higher levels for instruments assessing sexual and general offending risk (scatter of points from the line being greater and the prediction ellipses larger), we did metaregression and subgroup analyses using the bivariate model to determine any possible explanations for this heterogeneity. These analyses found no evidence that sex, ethnicity, age, type of instrument, temporal design, assessment setting, location of offending outcome, length of follow-up, sample size, or publication status was associated with differences in predictive validity (web table 2). In addition, we have presented summary receiver operating characteristic curves for each type of instrument (web figures 3-5). Subtypes of tools performed similarly, lying within the 95% prediction region, with the possible exception of the SAVRY that produced higher levels of predictive accuracy than the other violence risk assessment instruments.
This systematic review and meta-analysis examined the predictive validity of violence risk assessment tools from 73 samples involving 24 847 individuals in 13 countries. Our principal finding was that there was heterogeneity in the performance of these measures depending on the purpose of the risk assessment. If used to inform treatment and management decisions, then these instruments performed moderately well in identifying those individuals at higher risk of violence and other forms of offending. However, if used as sole determinants of sentencing, and release or discharge decisions, these instruments are limited by their positive predictive values: 41% of people judged to be at moderate or high risk by violence risk assessment tools went on to violently offend, 23% of those judged to be at moderate or high risk by sexual risk assessment tools went on to sexually offend, and 52% of those judged to be at moderate or high risk by generic risk assessment tools went on to commit any offence. In samples with lower base rates than those that contributed to the review, such as in general psychiatry, positive predictive values will probably be even lower.25 However, negative predictive values were high, and suggest that these tools can effectively screen out individuals at low risk of future offending. Whether the cautious optimism13 that experts have described in relation to the ability to predict violence seems justified will depend on the use to which these instruments are put.
Comparisons with other medical tools
Any comparison of these risk assessment scores with other common medical diagnostic and prognostic tools poses several difficulties. Firstly, comparison with diagnostic tools is mostly inappropriate because risk assessment instruments attempt to predict the likelihood of a future outcome, whereas diagnostic instrument attempt to detect the presence of a current condition. Secondly, although it may be possible to compare performance statistics of these tools with those estimating, for example, cardiovascular risk, the implications of positive predictive values need to be considered in evaluating any comparisons. Violence risk assessment potentially leads to detention of individuals for longer than necessary, with its related economic,65 social,66 and civil rights consequences.67 By comparison with common medical prognostic tools, it is possible to argue that the predictive accuracy of violence risk assessment needs to be higher because of these consequences, which extend beyond the person to other people. On the other hand, it is precisely because of the risks to other people that low positive predictive values may not be as important as the ability of these instruments to predict those that are not at risk. Our introduction of a novel performance measure, the number safely discharged, could help quantify this in future research.
Despite these caveats, the areas under the curve found in this review (0.66 to 0.74) were not dissimilar to those found in studies examining scores from the most validated cardiovascular risk scheme in predicting cardiovascular disease events. Areas under the curve from the Framingham scoring system range from 0.57 to 0.86, the SCORE from 0.65 to 0.85, and QRISK from 0.76 to 0.79.68 Many of these studies report associations between predicted and observed risks,69 which may be helpful for future research in violence risk assessment. Finally, the standard by which these instruments are compared will differ depending on their setting. In forensic psychiatry, a more meaningful comparison will be with unstructured clinical judgment, and clinical trials are needed to test whether structured risk assessment reduces adverse outcomes.
One implication of these findings is that, even after 30 years of development, the view that violence, sexual, or criminal risk can be predicted in most cases is not evidence based. This message is important for the general public, media, and some administrations who may have unrealistic expectations of risk prediction for clinicians.70 This expectation is not as high in other medical specialties, in which the expectation that the doctor will identify the individual patient who will have an adverse event is not a primary issue whereas psychiatry, in many countries such as the UK, has developed a culture of inquiries.71
A second and related implication is that these tools are not sufficient on their own for the purposes of risk assessment. In some criminal justice systems, expert testimony commonly use scores from these instruments in a simplistic way to estimate an individual’s risk of serious repeat offending.67 However, our review suggests that risk assessment tools in their current form can only be used to roughly classify individuals at the group level, and not to safely determine criminal prognosis in an individual case. This approach is mostly used in forensic psychiatry in the UK and other western countries, where they form part of a wider clinical assessment process. These instruments may also assist in developing risk management plans in selected high risk groups, as suggested by recent clinical guidelines in England and Wales.72 Furthermore, they are preferable to unstructured clinical judgment owing to their increased transparency and reliability.
Another implication is that actuarial instruments focusing on historical risk factors perform no better than tools based on clinical judgment, a finding contrary to some previous reviews.21 73 Finally, our review suggests that these instruments should be used differently. Since they had higher negative predictive values, one potential approach would be to use them to screen out low risk individuals. Researchers and policy makers could use the number safely discharged to determine the potential screening use of any particular tool, although its use could be limited for clinicians depending on the immediate and service consequences of false positives. A further caveat is that specificities were not high—therefore, although the decision maker can be confident that a person is truly low risk if screened out, when someone fails to be screened out as low risk, doctors cannot be certain that this person is not low risk. In other words, many individuals assessed as being at moderate or high risk could be, in fact, low risk. Ultimately, however, what constitutes an appropriate balance between the ethical implications of detaining people based on the predictive ability of these tools and the need for public protection will primarily be a political consideration.
Comparison with other studies
Previous meta-analyses on risk assessment have focused on comparing instruments with one another, or measuring how individual tools perform across sexes and ethnic groups.74 A systematic review published in 2001 examined the accuracy of violence risk assessment in high risk groups,19 and was based on 21 studies. It estimated that six people needed to be detained to prevent one violent offence, compared with our current review’s estimate of two people needing detention. This difference was despite the median base rate of violence being similar in both reviews (current review, 32% (interquartile range 22-46%) v 2001 review, 26%, 15-41%). Unlike the previous report, the present meta-analysis focused on structured assessment instruments and included both institutional and community samples. The current report reviewed more than three times as many studies as the 2001 review and a recent meta-analysis that only compared head to head investigations of tool use.75
Strengths and limitations
The strengths of the current review include the incorporation of new tabular data, the reporting of multiple accuracy estimates, and a meta-analysis using bivariate models. We received new tabular data for 14 798 people (60% of people included in the review), and hence have reported a considerable amount of new data. Finally, by using a range of accuracy estimates, we have attempted to minimise biases that may be associated with reporting only one of them.
Limitations include that we solely examined the predictive qualities of these risk assessment tools, and did not account for their potential role in informing management and reminding clinicians to enquire about potentially important prognostic and modifiable factors.76 In addition, we found moderate to high levels of heterogeneity. Heterogeneity was to be expected, in view of the different types of samples included in the primary studies (from prison, secure hospitals, and general psychiatric hospitals) and outcomes measured.77 78 We explored sources of heterogeneity and found no clear trends. Investigating heterogeneity in diagnostic odds ratios meant that incidence of the outcome was accounted for. One possible source of heterogeneity was the potential effects of intervention after a risk assessment, particularly in people deemed high risk. We compared diagnostic odds ratios between prospective and retrospective studies that would be expected, to some extent, to measure this, since high risk participants identified in prospective studies would probably have been enrolled in interventions designed to reduce violence risk. However, we found no differences in metaregression or subgroup analysis. Nevertheless, clinical trials are needed directly to test the possible effects of intervention. Although we tested for publication status and found no clear patterns, we cannot exclude the possibility that such bias could exist in the studies that we were unable to include. Registers of such investigations would assist future reviews.79 In addition, few samples reported on women and, thus, this review was underpowered to examine whether predictive validity was different from men.
What is already known on this topic
Instruments based on structured risk assessment predict antisocial behaviour more accurately than those based on unstructured clinical judgment
More than 100 such tools have been developed and are increasingly used in clinical and criminal justice settings
Considerable uncertainty exists about how these tools should be used and for whom
What this study adds
The current level of evidence is not sufficiently strong for definitive decisions on sentencing, parole, and release or discharge to be made solely using these tools
These tools appear to identify low risk individuals with high levels of accuracy, but have low to moderate positive predictive values
The extent to which these instruments improve clinical outcomes and reduce repeat offending needs further research
Cite this as: BMJ 2012;345:e4692
We thank the following study authors for providing tabular data for the analyses: April Beckmann, Sarah Beggs, Susanne Bengtson Pedersen, Klaus-Peter Dahle, Rebecca Dempster, Mairead Dolan, Kevin Douglas, Reinhard Eher, Jorge Folino, Monica Gammelgård, Robert Hare, Grant Harris, Leslie Helmus, Andreas Hill, Hilda Ho, Clive Hollin, Christopher Kelly, Drew Kingston, P. Randy Kropp, Michael Lacy, Calvin Langton, Henry Lodewijks, Jan Looman, Karin Arbach Lucioni, Jeremy Mills, Catrin Morrissey, Thierry Pham, Charlotte Rennie, Martin Rettenberger, Marnie Rice, Michael Seto, David Simourd, Gabrielle Sjöstedt, Jennifer Skeem, Robert Snowden, Cornelis Stadtland, David Thornton, Jodi Viljoen, Vivienne de Vogel, Zoe Walkington, and Glenn Walters.
Contributors: SF devised and coordinated the project, assisted in data acquisition and interpretation, and drafted and revised the manuscript. JPS assisted in data acquisition, performed the statistical analyses, assisted in interpreting results, and assisted in drafting and revising the report. HD assisted in statistical analysis and critically revised the manuscript for important intellectual content. MG assisted in interpreting results and critically revising the manuscript for important intellectual content. SF and JPS had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis, and will act as guarantors.
Funding: SF is funded by the Wellcome Trust.
Competing interests: All authors have completed the Unified Competing Interest form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: SF is funded by the Wellcome Trust; no financial relationships with any organisations that might have an interest in the submitted work in the previous 3 years; no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: No ethics approval was sought because only secondary data were used.
Data sharing: Data sharing: No additional data available.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.