Intended for healthcare professionals


Is the SF-36 a valid measure of change in population health? Results from the Whitehall II study

BMJ 1997; 315 doi: (Published 15 November 1997) Cite this as: BMJ 1997;315:1273
  1. Harry Hemingway (h.hemingway{at}, senior lecturer in epidemiologyb,
  2. Mai Stafford, statisticiana,
  3. Stephen Stansfeld, senior lecturer in community psychiatrya,
  4. Martin Shipley, senior lecturer in medical statisticsa,
  5. Michael Marmot, professor of epidemiology and public healtha
  1. a International Centre for Health and Society, Department of Epidemiology and Public Health, University College London Medical School, London WC1E 6BT
  2. b Department of Public Health, Kensington&Chelsea and Westminster Health Authority, London W2 6LX
  1. Correspondence to: Dr H Hemingway International Centre for Health and Society, Department of Epidemiology and Public Health, University College London Medical School, London WC1E 6BT
  • Accepted 7 July 1997


Objective: To measure within-person change in scores on the short form general health survey (SF-36) by age, sex, employment grade, and disease status.

Design: Longitudinal study with a mean of 36 months (range 23–59 months) follow up, with screening examination and questionnaire to detect physical and psychiatric morbidity.

Setting: 20 civil service departments originally located in London.

Participants: 5070 male and 2197 female office based civil servants aged 39–63 years.

Main outcome measures: Change in the eight scales of the SF-36 (adjusted for baseline score and length of follow up) and effect sizes (adjusted change/standard deviation of differences).

Results: Within-person declines (worsening health) with age were greater than estimated by cross sectional data alone. General mental health showed greater declines among younger participants (P for linear trend <0.001). Employment grade was inversely related to change; lower grades had greater deteriorations than higher grades (P<0.001 for each scale in men; P<0.05 for each scale in women except general health perceptions and role limitations due to physical problems). The greatest declines were seen among participants with disease at baseline, with the effects of physical and psychiatric morbidity being additive. Effect sizes ranged from 0.20 to 0.65 in participants with both physical and psychiatric morbidity.

Conclusions: Health functioning, as measured by the SF-36, changed in hypothesised directions with age, employment grade, and disease status. These changes occurred within a short follow up period, in an occupational, high functioning cohort which has not been the subject of intervention, suggesting that the SF-36 is sensitive to changes in health in general populations.

Key messages

  • The SF-36, an inexpensive measure of health outcomes, is capable of detecting change in health in a general population

  • Health and functioning do not decline uniformly with age; general mental health shows greater declines among younger participants

  • Socioeconomic status is associated inversely with baseline functioning and, independently, with decline in health

  • The greatest declines were seen among subjects with physical and psychiatric morbidity at baseline

  • Performance of 28 doctors and medical laboratory scientific officers in distinguishing pairs of slides


Measuring changes in population health is important to evaluate interventions and to predict the need for health and social care. The traditional measures of mortality and morbidity, although useful, have limitations: showing changes in mortality requires prolonged periods of observation or large numbers of events, or both, and changes in morbidity are more expensive to measure and do not take account of the functional impact on a patient's life. A given level of objectively assessed morbidity may have widely differing impacts on individuals' physical, psychological, and social functioning.1 Since levels of functioning are important in predicting demand for services, changes in such health related quality of life outcomes might complement mortality and morbidity measures. Although changes in quality of life are increasingly used as outcome measures in clinical trials,2 3 they have rarely been studied in populations other than patients.

The short form 36 health survey (SF-36)4 5 6 is a 36 item questionnaire which measures health functioning on eight scales and is among the most widely used measure of quality of life in studies of patients7 8 and the general population.9 10 11 12 13 14 15 16 17 18 19 Cross sectional data from population studies have shown that the SF-36 is reliable and able to detect differences between groups defined by age, sex, socioeconomic status, geographical region, and clinical conditions. The SF-36 may therefore be a useful tool for monitoring changes in health in the population.

There are, however, no reports of using repeated measures of the SF-36 in population studies, so it is not known whether it is sensitive in detecting changes within individuals over time. We report here individual changes in SF-36 scores in the Whitehall II study of British civil servants. On the basis of our previous cross sectional data19 we hypothesised that over a three year follow up period a decline in scale scores (worsening health) would be associated with age (directly for physical functioning, inversely for general mental health); socioeconomic status (inversely); and the presence of chronic, progressive, or recurrent disease at baseline.



All non-industrial civil servants aged 35 to 55 years working in the London offices of 20 departments were invited to participate in this study. The final cohort consisted of 10 308, with an overall response rate of 73%, although the true response rate was probably higher as 4% of those on the list of employees had moved before the start of the study and were therefore not eligible for inclusion. Employment grade within the civil service was used as a measure of socioeconomic status. On the basis of salary the civil service identifies 12 non-industrial grades. To obtain sufficient numbers for meaningful analysis we combined the top six groups into grade 1 and the bottom two groups into grade 6, thus producing six grade categories. The salaries ranged from £6483-11 917 (grade 6) to £28 904-£87 620 (grade 1) in 1992.


Baseline SF-36 scores (UK standard version) were measured at the third phase of the study, between August 1991 and May 1993 on 8349 participants (5763 men and 2586 women). At phase 4, between April 1995 and June 1996, an identical version of the SF-36 was completed by 7949 participants (5467 men and 2482 women). For the purposes of this paper, phase 3 measurements are referred to as baseline and phase 4 measurements as follow up. Sixty seven participants died between the end of phase 3 and the beginning of phase 4.

The SF-36 consists of 36 items scored in eight scales: general health perceptions (5 items), physical functioning (10), role limitations due to physical functioning (4), bodily pain (2), general mental health (5), role limitations due to emotional problems (3), vitality (4), and social functioning (2). The remaining item, relating to change in health, is not scored as a separate dimension. As an example of scale content, the physical functioning scale covers limitations during a typical day (“a lot,” “a little,” “none”) in vigorous activities (strenuous sports, running, etc), moderate activities (housework, playing golf, etc), lifting and carrying, climbing stairs, bending, kneeling, and walking. Scores for all scales were calculated using the medical outcomes study (MOS) scoring system20 and ranged from 0 (lowest wellbeing) to 100 (highest wellbeing). These scales had high internal consistency at baseline (Cronbach's α 0.76-0.86). The mean percentage of items missing across all scales was related to sex (0.38% in men and 0.50% in women; P=0.02), age (0.65% in those 55 years and older and 0.14% in those 44 years and younger; P<0.001), and grade (0.60% in the lowest and 0.25% in the highest grade; P<0.001).

Participants were categorised into four mutually exclusive groups according to their disease status at baseline: healthy (free of the following conditions), physical disease only, minor psychiatric disorder only, and both physical disease and minor psychiatric disorder. Physical diseases (chosen on a priori grounds as likely to affect physical functioning) were defined as one or more of the following: angina (n=450),21 self report of doctor diagnosed heart attack or angina (n=150), probable or possible ischaemia on resting electrocardiogram (Minnesota codes 1–1 to 1-3, 4–1 to 4-4, 5–1 to 5-3, and 7-1-1) (n=707), hypertension (>160/90 or taking antihypertensive drugs) (n=1554), claudication (n=125),21 diabetes (self report or oral glucose tolerance test) (n=222),22 chronic bronchitis (n=914),23 musculoskeletal disorders (self report) (n=1257), and cancer (OPCS registration or self report) (n=128). Minor psychiatric disorder, principally anxiety and depression, was defined as a score of ≥5 on the 30 item general health questionnaire (n=1489).24

Statistical analysis

Changes in SF-36 were examined by age, employment grade, and disease status separately for men and women. A negative change reflects a decline in scores and, if valid, a deterioration in health. As expected, participants who had high scores at baseline had lower scores at follow up and vice versa, a common phenomenon known as regression to the mean. Analysing such data using simple differences is problematic as the magnitude of the change would depend on the level at baseline.25 26 Furthermore, the changes in scores, unlike single measurements, are normally distributed. Therefore, we used regression models separately for men and women for each scale of the form: follow up score-baseline score= baseline score+covariate 1+covariate 2 … etc

These models give a change score adjusted for the potential bias of regression to the mean. Longitudinal estimates of change per year were obtained using these models from the coefficients for length of follow up in which the intercept term was constrained to be zero. Adjustment was made for the potential confounding of differing lengths of follow up in all models, even though age, grade, and disease status had only small effects on length of follow up. Two tailed tests were used throughout. Effect sizes for cross sectional data comparing two groups were calculated by dividing the difference between the means of the two groups by the sex specific standard deviation for that scale. Effect sizes for change were calculated by dividing the adjusted change by the sex specific standard deviation of the differences for that scale. A higher effect size indicates greater sensitivity to change. All analyses were performed using the SAS statistical package (SAS Institute, Cary, NC).


The median follow up was 36 months (range 23–59 months). The median age at follow up was 52 (42–65) years for men and 53 (42–65) years for women. Participants who did not respond at follow up had lower mean scores at baseline on all scales except the vitality scale for women (data not shown).

Unadjusted mean scores on all scales were lower at follow up (Table 1). Table 2 shows a comparison of cross sectional and longitudinal estimates of change in scale score per year. The cross sectional data tended to underestimate the within-person decline with increasing age in general health perceptions, physical role limitation, and bodily pain. While the cross sectional data suggested improvement with age in general mental health, emotional role limitation, vitality, and social functioning, the longitudinal estimates showed that only emotional role limitations (men and women) and general mental health (men only) improved with age.

Table 1

Absolute change in SF-36 scores between baseline and follow up (mean 36 months)

View this table:
Table 2

Mean change in SF-36 scores per year of follow up

View this table:

Table 3 shows the adjusted change within five year age groups at baseline. Younger men had greater declines in general mental health, emotional role limitation, vitality, and social functioning than older men (P for linear trend <0.001). Among women, a similar relation was seen in the vitality and general mental health scales (P<0.01). Indeed, adjusted change was positive on general mental health and vitality in the oldest age group. Older participants showed greater declines in physical functioning than younger participants (P<0.001).

Table 3

Mean change (adjusted for baseline score) in SF-36 scores per year of follow up within age groups

View this table:

Men in the lower employment grades had greater declines on all scales than men in higher grades (P for linear trend <0.001), and women in the lower grades had significantly greater declines on all scales except physical role limitation and general health perceptions (Table 4). Women in the high grades showed positive change for bodily pain, vitality, emotional role limitation, and general mental health.

Table 4

Change (adjusted for baseline score, length of follow up, and age) in SF-36 scores by employment grade

View this table:

The 1 shows the effect of civil service employment grade on age related declines in physical functioning. For men there was evidence of an interaction (P<0.01), with the men in the highest grade less likely than men in the lowest grade to show declines with age.


Employment grade and age differences in change (adjusted for baseline score and length of follow up) in physical functioning measured by SF-36 questionnaire

Participants with both physical and psychiatric morbidity had worse health at baseline than those with either physical or psychiatric morbidity alone, with effect sizes ranging from 0.99 to 1.31 (Table 5). Table 6 shows the effect of baseline disease category on change in SF-36 scores adjusting for baseline score, length of follow up, and age. For both cross sectional and longitudinal data, there was some evidence that physical diseases predominantly affected physical scales (physical functioning, physical role limitation, and pain) and psychiatric morbidity predominantly affected psychological and social functioning. Participants who had either physical disease or were defined as cases on the general health questionnaire had greater declines than those with no medical conditions (P for heterogeneity <0.01). The presence of physical disease or minor psychiatric morbidity was associated with a similar magnitude of effect for all scales except emotional role limitations, for which the decline was larger among the cases. The effects of physical disease and minor psychiatric morbidity were approximately additive; participants who had both of these experienced greater decline on all scales than participants who had either of these. Effect sizes in this group ranged from 0.20 to 0.65.

Table 5

Cross sectional mean SF-36 scores (adjusted for age) by disease category at baseline

View this table:
Table 6

Change (adjusted for baseline score, length of follow up, and age) in SF-36 scores by presence of disease at baseline

View this table:


This is the first report of the ability of the SF-36, a simple and inexpensive measure of health outcomes, to detect change in health in a general population. In the high functioning, occupational cohort of the Whitehall II study, each of the eight dimensions of health measured by the SF-36 showed a mean decline over three years of follow up. As hypothesised, employment grade was inversely related to decline, with lower grades experiencing greater deteriorations than higher grades. The greatest declines were seen among subjects with disease (considered to be chronic, recurrent, or progressive) at baseline, with the effects of physical and psychiatric morbidity being additive.

Changes in health

Measuring health on continuous scales, rather than dichotomising individuals as diseased or disabled or not, allows elucidation of trajectories of change. Decline in health and functioning has been assumed to be the biologically inevitable consequence of aging, although this concept is increasingly questioned27 by the heterogenous patterns of change that are found in populations aged over 65.28 29 30 Cross sectional results in the Whitehall II study (participants aged 39-63) tended to underestimate the longitudinal decline with age in general perceptions of health, physical role limitation, and bodily pain. Furthermore, while these cross sectional data19 suggested improvement with age in general mental health, emotional role limitations, vitality, and social functioning, the longitudinal data supported this improvement only for emotional role limitations (men and women) and general mental health (men). However, general mental health and vitality showed greater declines among younger participants. This is evidence against a simple age effect and suggests the existence of cohort or period effects, or both.31 Although further measures of the SF-36 are required to distinguish between these, it is plausible that increasing privatisation in the civil service may constitute a relevant period effect.32 Sex differences also showed a different pattern in cross sectional and longitudinal results. In cross sectional results,10 women consistently had lower SF-36 scores than men whereas in longitudinal results there were no such sex differences. Existing studies measuring change in health functioning—all in elderly people—have either used a simple measure of global health status or have concentrated on physical functioning and disability.33 34 35 36

Socioeconomic status is inversely associated with both the risk of developing disease and the risk of people with disease experiencing complications; furthermore, there is a growing recognition that these risks are mediated via mechanisms which may change through the life course.37 Few studies have examined the effects of socioeconomic status on changes in health in individuals. Socioeconomic status was inversely associated with baseline SF-36 and with change in score after adjustment for baseline score. The effect sizes for high versus low civil service employment grade on change in health functioning tended to be greater than for cross sectional effects at baseline, particularly among women. Among men, there was an interaction between age and grade in changes in physical functioning, such that men in the lowest grade showed greater declines with age than men in the highest grade. This is consistent with the hypothesis of environmental determinants of “successful aging.”27

The impact (in terms of effect sizes) of physical and psychiatric morbidity on baseline scores was similar to that reported in other cross sectional studies in patient populations.38 Effect sizes for change in score ranged from 0.20 to 0.65 in participants with both physical and psychiatric morbidity, and such changes are of comparable magnitude to short term changes after clinical interventions.39 40 It should not be assumed that effect sizes of clinical and public health significance are equivalent41; indeed, the latter may be smaller.42 The disease groups were deliberately chosen to reflect morbidity which was chronic, progressive, or recurrent. However, in the absence of independent measures of disease severity over time, it is not possible to say which of these, or other, effects were responsible for the observed changes.

The ability of the SF-36 to detect change in less healthy general populations is likely to be greater than that observed among Whitehall II participants. The Whitehall II cohort is high functioning by virtue of the comparatively young age of participants (none was older than 65 years) and the fact that all were employed as civil servants in non-industrial grades at the start of the study (in terms of mortality the Whitehall II participants enjoy a healthy worker effect of 0.5). The relatively short period of follow up (mean 36 months) and the absence of any systematic intervention further suggest the potential for the SF-36 to detect changes in health.

Potential limitations of study

Potential limitations of this study should be considered. Non-response at follow up is likely to have biased the effects conservatively, since non-responders tended to be from lower employment grades and have lower SF-36 scores at baseline. Since both the SF-36 and some of the diseases were based on self reported measures, it is possible that a reporting bias could arise. However, the use of objectively defined disease categories as well as the similarity of effect sizes with other studies which have used doctors' diagnoses38 suggests that this is unlikely to be important.

There is no gold standard measure of change in health functioning. Changes in SF-36 scores, if valid, may be expected to be associated quantitatively with use of health services, sickness absence, morbidity, and mortality and qualitatively with individuals' own accounts of their health. Quantitative studies are required for interpreting the significance of a given level of change in SF-36 score; it can not be assumed that a small effect size (arbitrarily defined as 0.243) is necessarily unimportant. Furthermore it is likely that the sensitivity in detecting true change differs between the scales of the SF-36. Future analyses within the Whitehall II study will examine this and, by using further repeated measures of the SF-36, the trajectories and predictors of changes in health.


In the primary care led NHS, general practitioner and health authority commissioners have responsibility for evaluating the need for and effectiveness of health care in populations defined by a general practitioner's list or residence in a health authority. Comparisons of health outcome between differing patterns of primary care (for example, between fundholders and non-fundholders) have been limited by the lack of an outcome measure sensitive to change and the inability to adjust for differences in case mix. The SF-36 may offer a partial solution. Assessing changes in health functioning has the advantage of reflecting the impact of all causes of morbidity44 on different dimensions of health. Statistical adjustment of change in health functioning for baseline values offers the potential of a simple method for taking account of case mix, although the validity of this approach needs testing.

Changes in health functioning in hypothesised directions with age, employment grade, and disease status were observed in this young, high functioning population. These results provide sufficient support for the validity of the SF-36 in measuring change in health in populations to recommend this use in other studies, with the caveat that the validity of change continues to be tested.


We thank all participating civil service departments and their welfare, personnel, and establishment officers; the Civil Service Occupational Health Service (Dr George Sorrie, Dr Adrian Semmence, and Dr Elizabeth McCloy); the Civil Service Central Monitoring Service and Dr Frank O'Hara; the Council of Civil Service Unions and all participating civil servants. We thank Jenny Head for comments on an earlier draft.

Conflict of interest: None.

Funding: Grants from the Agency for Health Care Policy and Research (5 RO1 HS06516); New England Medical Centre-Division of Health Improvement; National Heart Lung and Blood Institute (2RO1 HL36310); National Institute on Aging (R01 AG13196-02); John D and Catherine T MacArthur Foundation Research Network on Successful Midlife Development; Institute for Work and Health, Ontario, Canada; Volvo Research Foundation, Sweden; Medical Research Council; Health and Safety Executive; and British Heart Foundation. MM is supported by an MRC research professorship.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.