The association between exaggeration in health related science news and academic press releases: retrospective observational study

Objective To identify the source (press releases or news) of distortions, exaggerations, or changes to the main conclusions drawn from research that could potentially influence a reader’s health related behaviour. Design Retrospective quantitative content analysis. Setting Journal articles, press releases, and related news, with accompanying simulations. Sample Press releases (n=462) on biomedical and health related science issued by 20 leading UK universities in 2011, alongside their associated peer reviewed research papers and news stories (n=668). Main outcome measures Advice to readers to change behaviour, causal statements drawn from correlational research, and inference to humans from animal research that went beyond those in the associated peer reviewed papers. Results 40% (95% confidence interval 33% to 46%) of the press releases contained exaggerated advice, 33% (26% to 40%) contained exaggerated causal claims, and 36% (28% to 46%) contained exaggerated inference to humans from animal research. When press releases contained such exaggeration, 58% (95% confidence interval 48% to 68%), 81% (70% to 93%), and 86% (77% to 95%) of news stories, respectively, contained similar exaggeration, compared with exaggeration rates of 17% (10% to 24%), 18% (9% to 27%), and 10% (0% to 19%) in news when the press releases were not exaggerated. Odds ratios for each category of analysis were 6.5 (95% confidence interval 3.5 to 12), 20 (7.6 to 51), and 56 (15 to 211). At the same time, there was little evidence that exaggeration in press releases increased the uptake of news. Conclusions Exaggeration in news is strongly associated with exaggeration in press releases. Improving the accuracy of academic press releases could represent a key opportunity for reducing misleading health related news.

Advice. The PR, news stories and JA (abstract and discussion) were read for statements of implicit or explicit advice. Coding levels were: 0, No advice (including advice to researchers, for example to perform further study). 1, Implicit advice (e.g. "Eating chocolate might be beneficial for... "). 2, Explicit but not to reader or general public (e.g. "Doctors should advise patients to...") 3, Explicit to reader / general public (e.g. "Expectant mothers should... ").
A range of examples of inflation is given below (note that not all would be considered inappropriate if they are just changes to the intended audience; our purpose is not to evaluate each inflation, but to find the source given that inflation is generally blamed on journalists). PR: Mothers who want to breastfeed should be given all the support they need (code 2); News: Mums should breastfeed for at least four months to avoid having naughty kids, experts now advise (code 3). PR: If these weather patterns continue, both forage and dairy management will have to adapt to maintain current milk quality (code 1); News: spend 9p extra a pint and save Daisy the Dairy Cow, in her straw hat (code 3). JA: the data we present add to growing justification to monitor the health of preterm men and women beyond infancy and childhood (code 1); PR: we need to monitor the health of premature babies beyond infancy and childhood (code 2). JA: These specific defects should be included in public health educational information to encourage more women to quit smoking (code 2); PR: women should quit smoking before becoming pregnant, or very early on, to reduce the chance of having a baby with a serious and lifelong physical defect (code 3). PR: It is possible that good nutrition during the first three years of life may encourage optimal brain growth (code 1); News: People should seek advice from a registered dietician, but simply it's a message of moderating fat intake, five fruit and veg a day and whole grain starchy foods (code 3). PR: Our findings support the concept of more widespread HIV testing (code 1); News: if you've been at risk for HIV, get tested now (code 3).
Causal statements from correlational research. For each PR and news story, the IV (or pseudo IV in correlational designs), DV and stated relationship between them (if any) were extracted from the main claims, which were operationalized as the title plus first two sentences in PRs and news. For the JA, main claims were defined within the abstract and discussion sections. If there were claims about more than one set of IV and DV in the PR or news, a second set was also coded and the same sets were identified in the JA, allowing us to test whether the findings for the main statements are replicated in the second statements (SI5).
In order to consistently code the 6 levels of relationship statement we drew up a table of examples from the first stage of coding. These were:

No relationship stated (but could have been):
The study must have contained at least two variables (IV and DV, or pseudo IV and DV) between which a relationship could have been stated. If there were not two suitable variables, the code 'not applicable' was used. 1. Statement of NO relationship/cause: e.g. 'no difference'; 'persists without'; 'does not result in'; 'no significant extra risk'; 'added no benefit'.
Where statements of different levels were made within the analysed segments of text, stronger statements trumped weaker ones. Separately, we also coded whether or not the statement of relationship was explicitly probabilistic -for example, 'correlated with the risk of...' (correlational probabilistic); 'raises the chance of...' (causal probabilistic); Further probabilistic words/phrases included: 'likelihood'; 'makes more likely'; 'tendency'; 'rate'.
For analysis of causal claims we focused on correlational research; we coded type of study design using 6 categories: Qualitative; Correlational cross-sectional; Correlational longitudinal; Intervention (not full RCT); Full randomised controlled trial (RCT); Modelling / Simulation. We did not detect any differences in the distribution of causal statement levels between cross sectional and longitudinal correlational designs; therefore we grouped these together into a single correlational category for further analysis. We did not analyse qualitative, interventional or simulation studies further. We checked whether the IVs and DVs themselves got distorted, changed or generalised in the progression of claims. We find that this happened in PR for IVs in only 11/573 samples, and for DVs in only 6/554; similarly in news only 21/726 for IVs and 11/740 for DVs.
A range of examples of inflation is: JA: This observational study found significant associations between use of antidepressant drugs and several severe adverse outcomes in people aged 65 and older with depression (code 2); PR: New antidepressants increase risks for elderly (code 6). JA: Reported flooding experiences had a significant relationship with perceptions relating to climate change (code 2); PR: Direct experience of extreme weather events increases concern about climate change (code 6). JA: A brief TCBT or exercise program was associated with substantial, significant, clinically meaningful improvements in self-rated global health (code 2); PR: Talking therapy over the phone improves symptoms of chronic widespread pain (code 6). JA: deregulation of a single kinase in two distinct cellular compartments... is intricately linked to implantation failure and miscarriage (code 3); News: The protein SGK1 in the lining of the womb makes it harder to get pregnant (code 6). JA: bisphosphonate use is associated with a significantly lower rate of revision surgery of up to about 50% ... in patients without a previous fracture (code 2); News: Bisphosphonates 'extend hip replacement life' (code 6). JA: human orbital volume significantly increases with absolute latitude (code 2); News: … gives you a bigger brain (code 6). JA: ... association between RXRA chr9:136355885+ methylation and mother's carbohydrate intake (code 2); PR: During pregnancy, a mother's diet can alter the function of her child's DNA (code 5).
Human conclusions from non-human studies. We coded the explicit or implicit study sample, population type or experimental participants of the main claims in JA, PR and news. We used the same code to separately identify the actual sample, population type or experimental participants of the study. If there was more than one type (e.g. rodent and human) in a JA, it was excluded from the analysis of human inference from non-human studies. The coding levels were: expression is restricted to presumptive molar mesenchyme and throughout tooth development to molar mesenchyme cells (code 4); PR: Researchers have uncovered a novel mechanism they have termed 'developmental stalling', that might explain how errors in the development of human embryos are naturally corrected to prevent birth defects (code 2).

Caveats and justifications.
For each section, we searched the whole PR and news stories for any caveats stated for the advice or claims (e.g. "This is a population study. It can't say definitively that sugary drinks raise your blood pressure, but it's one piece of the evidence in a jigsaw puzzle"; "The scientists who carried out the study emphasised that they could not say for certain..."). Similarly, we searched for justifications of the advice or claims (e.g. "even after taking into account the effect of extra bodyweight on blood pressure, there was still a significant link with sweetened drinks").
Study facts and quotes. We also coded various facts about the study and PR, including sample size, duration, completion rate and the source of quotes. These are analysed in section SI11 (Indicators of news sources).

SI3. Inter-rater reliability.
We double-coded 27% of PR and associated JA, and 21% of news stories. This difference is due to the fact that the PRs randomly selected for double coding had lower than average number of news stories. Inter-rater concordance was 90.5% (κ= .87) for cells relevant for the advice analysis; 86.3% (κ= .84) for cells in the analysis of causal claims and 94.4% (κ= .93) for cells analysed for human inference from non-human research. We analyzed the distribution of coding disagreements where they arose in the double-coded samples (i.e. whether each disagreement was between a code 1 and 2, or between a code 2 and 3 etc). Then within each round of the simulations in section SI7, 10% of the samples were by chance changed to another code in line with the observed distribution of coding disagreement in the double coded samples. This had a negligible effect on our results.

SI4. Association between advice, causal statements and human inference.
Of the studies contributing to the analysis of advice, 110 were included in the analysis of causal claims from correlation, while 19 were non-human studies included in the human inference analysis. There were only 14 studies that were both non-human and correlational. Thus while the analyses of advice and causation share many PRs, JAs and News, the analysis of nonhuman studies is on a largely independent sample of PRs, JAs and news.
Within the 110 correlational studies included in the advice analysis (because some level of advice was offered somewhere in JA, PR or news), the level of advice was not correlated with the level of causal claim within JA, PR or news (r=0.02, 0.05 and -.003, respectively). Within the 19 non-human studies included in the advice analysis (because some level of advice was offered somewhere in JA, PR or news), the level of advice was not significantly associated with the level of human inference (r=0.07, p=0.78; r=0.29, p=0.23; r= -0.29, p=0.12; although note that N in this analysis is small).

SI5. Secondary Statements (i.e. about a second set of variables in correlational studies)
For the secondary statements 25% (95% CI: 18-34%) were more strongly deterministic than those present in the associated JA. The odds of exaggerated statements in news were 36 times higher (OR=36, 95% CI: 7.8-148) when PR statements were exaggerated (83%, 95% CI 65-100%) than when the PR was not exaggerated (12%, 95% CI: 3.2-22%; difference=70%, 95% CI: 51-90%). Thus while secondary statements tended to be exaggerated less often (presumably because they are not the leading eye-catching statement), the association between exaggeration in PR and news is still very strong, replicating the results for main statements.
For rates of news uptake, 44/76 (58%) PRs without exaggeration had news uptake vs 13/26 (50%) PRs with exaggerated claims (bootstrapped 95% confidence intervals of the difference are -30% to +15%). Non-exaggerated secondary causal claims were associated with 3.0 news stories per PR, while exaggerated causal claims were associated with 2.2 news stories per PR (confidence intervals of the difference are -1.8 to +0.3).

SI6. Breakdown of PR exaggeration for exaggerated news
In the main analysis we categorized news and PR as exaggerated or not relative to the JA. This simple categorization did not distinguish between PRs that are exaggerated to the same extent as news and PRs that are exaggerated a bit, while the news is exaggerated further. In fact the latter case was relatively rare, and the most common scenario was for an identical level of exaggeration in PR and news. In the cases where news went beyond what was written in the JA, Figure S1 shows the proportions of cases when the associated PR contained no exaggeration relative to the JA (left solid bars in each plot, labeled PR ≤ JA) or when the PR did contain exaggeration relative to the JA (the three rightward solid bars in each plot, labeled PR>JA). Within the cases where the PR went beyond the JA (PR>JA), we plot the proportions when the news was further exaggerated from the PR (N>PR), when the news had equivalent statements to the PR (N=PR) and when the news was deflated again from the PR (N<PR, but remember news is still inflated relative to JA in order to qualify for this analysis). The key results are that we consistently found the largest category to be 'N=PR'; in other words, when the news was inflated relative to the JA, the most likely scenario for the PR was that it said the same as the news.
By adding this category (N=PR) to cases where PR was even more inflated than news (N<PR), we find that the PR was at least as inflated as the news on 70% (advice), 48% (causal claims from correlation) and 75% (human inference from non-humans) of cases. The by adding in the cases where there was some inflation from JA to PR, and then some more from PR to news (second bar, N>PR and PR>JA), we find the overall inflation rates occurring between JA and PR (78%, 75%, 90%). On the other hand, the inflation occurring between PR and news (30%, 52%, 25%) can be obtained from adding the two leftward columns: PR ≤ JA (remember all cases in this analysis have inflation from JA to news) and N>PR. Thus the rate of inflation between JA and PR consistently outweighs the rate of inflation between PR and news.
B. Causal claims from correlation A. Advice C. Human inference from non--humans Figure S1. PR content where news contained exaggerated statements relative to the JA (N=131, 173, 49, respectively). In each plot, left bars (PR≤JA) indicate the cases where the PR contained nothing stronger than the JA. The other bars (PR>JA) indicate the cases where the PR contained inflated advice or statements relative to JA, in which case there could be further inflation in the news (N>PR), the same strength in news and PR (N=PR), or occasionally, deflation from PR to news (N<PR). Error bars show bootstrap-estimated 95% confidence intervals (the bootstrapping preserved the clustering structure of news to PR). The consistently most frequent situation in each analysis (A-C) was that the PR and news were equivalent, occurring much more often than chance prediction (dotted bars and associated error bars; advice, p<0.001; causal claims, p<0.001; human inference, p<0.001), estimated through simulating how often the observed distributions of coded levels in PR and news if written independently would produce each category plotted (see SI7). Adding the two rightmost bars together gives the proportion of cases where the PR was at least as inflated as the news (70%, 48% and 75% for A, B and C), while adding all three PR>JA bars together gives the proportion of occasions that there was some degree of inflation in the step from JA to PR (78%, 75%, 90%). For comparison, adding the two left bars together gives the total proportion of cases where there was some inflation from PR to news (30%, 52%, 25%). Table S1 presents the distributions of coded advice levels for each category of outlet. We simulated the expected number of times that chance selection from these distributions would lead to the four categories displayed in Figure S1: no inflation in PR but inflation in news; PR inflated from journal article and news inflated further; the same level of inflation in both PR and news; and news inflated relative to journal article but deflated relative to PR. For each of 10000 iterations, the JAs, PRs and news were randomly reordered with respect to each other, but preserving the distributions shown in Table S1 and the clustering structure of news to PRs (i.e. that more than one news article can come from the same PR), and the analysis was rerun to categorize the inflation level in PR when there was inflation in news, just as for the analysis of the actual data in Figure S1. We also incorporated an estimate for the effect of coding dis-concordance (see SI3). To do this, we analyzed the distribution of coding disagreements where they arose in the double-coded samples (i.e. whether each disagreement was between a code 1 and 2, or between a code 2 and 3 etc). Then within each round of the simulation 10% of the samples were by chance changed to another code in line with the observed distribution of coding disagreement in the double coded samples. Adding this effect of coding dis-concordance had a negligible effect on results.

SI7. permutation simulation of chance associations
Similarly for correlational/causal claims and for human/non-human claims, we performed equivalent simulations based on the distributions of each statement level found in each outlet (Tables S2 and S3), the clustering structure of news to PR and the observed coding disagreement distributions in the double coded samples.
Note that the comparison of the actual number of cases where PR=news to these permutation analyses is conservative, since the simulations are likely to overestimate the chance expectation of PR=news. This is because they are based on distributions for each outlet that are not, in fact, independent. If they were independent, the similarity between the distributions would likely be reduced and this in turn would reduce the estimate of the associations that would occur by chance. In the extreme of nonindependence, where most news stories were to copy a restricted range of phrases in PR, the estimated chance overlap would be very high due to the paucity of potential alternative options for the random sampling. In other words, since the occurrence of coding levels is not evenly distributed, as the real overlap between PR and news becomes larger, this simulation approach stacks the cards against finding differences between the data and the simulation. Thus we can be confident that where a statistically significant difference between the data and the chance simulations can be detected despite this bias, that difference is meaningful.

SI8. Predictors of news uptake
As shown in Figure 3 (main manuscript), inflation of advice, causal claims or human inference was not reliably associated with a higher proportion of PRs attracting news or a higher mean number of news per PR. That analysis compared inflated to non-inflated statements irrespective of the actual coded level of those statements (many strong statements are not inflated because they are also contained in the JA). While inflation was our main interest, we can also analyze whether simply the coded level of PR statements (irrespective of whether they were inflated relative to JA) was associated with news uptake. The proportion of PRs with news appeared to be about 15% greater where explicit advice was present, though this was not statistically significant even without correction for multiple comparisons (χ 2 (3)=6.1, p=0.11). There was even less indication that the proportion of PRs with news was predicted by the strength of main causal claims (χ 2 (6)=2.6, p=0.86), or human inference (χ 2 (2)=3. . There was no indication of an increase with stronger causal claims (F(6,207)=0.7, p=.7 uncorrected). Thus overall there is some suggestion, as would be expected, that stronger advice with relevance to humans attracted more news coverage, but these effects were, perhaps surprisingly, not strong enough to be clearly significant. Table S4 shows the number of PRs from each university included in our analyses of advice, causal claims and human inference (without double counting those included for more than one analysis), as well as the percentage of claims that were un-inflated and the percentage that had news uptake. Due to the low numbers once broken down by university, advice, main causal claims, secondary causal claims and human inference are added together to form one category of 'statements', which is why N statements differs from N PR. Note that the percent of statements without inflation is given (rather than with inflation) for straightforward multiplication with % news uptake in the combined score that estimates the % of non-inflated PR attracting news. The table is ordered by % uninflated PR claims in order to illustrate the lack of any correlation with the % news uptake (r=-.13). Note that in 10 cases identical press releases were released by two universities on the same research; these are included here for each university, but they were not double counted in Figure 1, or in the analyses for Figs 2 and 3.

SI9. Comparison between universities
The ranks of inflation and uptake are also shown, as well as a rank for the estimated % non-inflated PR attracting news. However, it is important to note the relatively large confidence intervals on these ranks. Ranks alone often leave it difficult for the reader to discern whether a rank order is clear-cut or largely due to small differences and random variation. We estimated the confidence intervals using the following procedure: For Birmingham, we drew [Birmingham N] times with replacement from the pool of [Birmingham N] relevant Birmingham PRs and calculated the percent inflation and uptake; then for Bristol we drew [Bristol N] times with replacement from the pool of [Bristol N] relevant Bristol PRs and calculated percent inflation and uptake; and so on for each university. Rank orders for inflation, uptake and combined scores were then found to produce a table for that round of resampling. This procedure was repeated 100000 times to create 100000 tables, from which the 95% confidence intervals for inflation rank, uptake rank and overall rank were estimated. The CIs are generally wide, partly because of the low N for some universities.
In the case of animal research, one possible reason for PRs to generalise to humans might be to avoid advertising animal research facilities.  where 42% did not provide any relevant caveats and 90% about animal or laboratory studies lacked caveats about extrapolating to humans.

SI11. Indicators of news sources.
To estimate the relative importance of PR as the main source for the science stories in our sample, independently from the factors analysed for our main questions, we used dates of release, quotes and study details. Note that these estimates do not necessarily reflect all science news, given that the news stories in our sample were purposely selected to be on the same studies as those in our PR materials.
Dates. For selecting news stories, we used a criterion of release date being within 30 days of the PR. In fact, 580/668 (87%) of news stories were released within a day of the PR release date.
Quotes. We coded up to four quoted sources in news stories. Of the 668 analyzed news stories, 592 (89%) had quotes; 427 (72%) of these stories contained quotes identical to those included in the PR; 263 (44%) had alternative or additional quotes from the authors of the associated peer-reviewed journal article; 29 (5%) contained quotes identical to text in the journal article; 50 (8%) had quotes from other sources (e.g. funders) related to the research; and 179 (30%) had quotes from independent scientists or 'experts'.
Study details. We coded whether and how accurately/precisely each PR and news story reported sample size (N), completion rate, length of study, and number of time points for longitudinal studies.
Of these data, we asked how often news stories provide details that were not contained in the associated PR (i.e. as evidence that the journalist has used a source additional to the PR

SI12. Scientist doublethink
Whilst instigating the main study, we performed an online survey of scientists' attitudes toward science in the media, and their experiences with PR. We advertised the survey via the Guardian, the BBSRC, and social media. The sample is self-selected and likely biased towards pre-existing interest in the topic of science news and by the subject area distribution in our advertising routes. As expected, the respondents (N=248) blamed journalists more than any other party for misreporting in science news. However, 79% of scientists who had PRs about their work reported involvement with those PRs, and despite this involvement, 32% acknowledged that their PRs were exaggerated ( Figure S2). Thus it appears that some scientists do have awareness that PRs are a source of misreporting, but as a group we appear to engage in doublethink -colluding in producing exaggerated PRs but mainly blaming the media for the shortcomings of science news. Table S5 shows the rates of inflation from PR to news and from JA to news for different outlets. As in Table S4, advice, causal claims and human inference are combined to form one category of 'statements', which is why N statements differs from N PR. Note that the percent of statements without inflation is given (rather than with inflation). Some news outlets had too few N to be included individually: The Mail on Sunday (N=1) and Mail Online (N=2) have been combined with The Daily Mail (N=89), The Sunday Sun (N=1) has been combined with The Sun (N=44), The Sunday Telegraph (N=1) has been combined with The Telegraph (N=80), The Sunday Times (N=1) has been combined with The Times (N=31); The Daily Star, The Economist, the New Scientist, and the Press Association (each N<6) have been excluded. Note that the reason N statements differ slightly between the comparisons to PR and to JA is because in some cases comparison could not be made, for example if the PR does not say anything upon which to base a code for animal vs human, then human/animal claims in news could not be compared to PR. The key results were that 40% (N=43) of respondents with experience of PR (N=107) perceived that their most recent PR was exaggerated (A). Unsurprisingly, this proportion decreased with greater levels of declared involvement in the preparation of PRs, but still remained above 30% even for those scientists who reportedly wrote the PR themselves (B). When asked who was responsible for erroneous science news (C), 30-60% attributed some responsibility to scientists and press offices; this may reflect awareness of some PR exaggeration. However, 100% of respondents attributed responsibility to newspapers. The survey and accompanying data can be downloaded from http://dx.doi.org/10.6084/m9.figshare.903704. We advertised the survey via the Guardian, the BBSRC, and social media. The sample is self-selected (likely due to pre-existing interest in the topic of science news) and possibly biased by the subject area distribution in our advertising routes.

SI13. Comparison between news outlets and journalist type
The table is ordered by % claims without inflation from PR to news, though we do not make any conclusions from this order, given the very wide confidence intervals on the ranks (calculated as for Table S4). The more appropriate conclusion appears to be that news outlets do not differ from each other as much as might be generally assumed.
We also coded whether the journalist for each news story was a generalist or health/science specialist. Counter to expectation, we detected no differences between these categories for inflation rates. For advice, there were 23 inflations from PR in 182 news stories for specialists (13%) compared to 21/179 (12%) for generalists (difference =0.9%, with 95% CI of -5.8% to 7.6%). For causal claims from correlational results, there were 71/201 (35%) for specialists vs 83/244 (34%) for generalists (difference =1.3%, with 95% CI of -7.7% to 10.1%). For human inference from non-human studies, there were 5/57 (9%) for specialists vs 12/95 (13%) for generalists (difference =-3.9%, with 95% CI of -13.7% to 6.7%). It may be noteworthy that specialists wrote about non-human studies less frequently than did generalists, which may indicate differing knowledge about the difficulties of translating animal results into treatments for humans. Table S5. Rates of inflation for news outlets in our study. Inflation is listed relative to PRs and relative to JAs (see text above for further explanation).