Endgames Statistical Question

# Units of sampling, observation, and analysis

BMJ 2015; 351 (Published 09 October 2015) Cite this as: BMJ 2015;351:h5396
1. Philip Sedgwick, reader in medical statistics and medical education1
1. 1Institute for Medical and Biomedical Education, St George’s, University of London, London, UK
1. Correspondence to: P Sedgwick p.sedgwick{at}sgul.ac.uk

Researchers sought the views of the British public on the acceptability of the use of personal medical data for the purposes of public health research and surveillance without individual consent. A cross sectional study was performed by the Office for National Statistics. A survey was constructed to ascertain the acceptability of the use of identifiable information for public health purposes in the context of the National Cancer Registry. The participants were recruited using multistage sampling of adults in the UK during March and April 2015. The multistage sampling consisted of three stages. At the first stage a sample of postal districts in the UK was selected at random, with the probability of selection proportional to size. Within each district, a random sample of private households was selected. In households with more than one adult, one person was selected at random. Face to face interviews were carried out with 2872 adults.1

Of the 2872 respondents, 72% (95% confidence interval 70% to 74%) did not consider any of the following to be an invasion of their privacy by the National Cancer Registry: inclusion of postcode, inclusion of name and address, or the receipt of a letter inviting them to a research study on the basis of inclusion in the registry. It was concluded that most of the British public does not consider the confidential use of personal identifiable patient information by the National Cancer Registry for the purposes of public health research and surveillance to be an invasion of privacy.

Which of the following statements, if any, are true?

• a) The sampling method involved three sampling units

• b) The postal districts were the primary sampling unit

• c) The unit of observation was the household

• d) The unit of analysis was the adult

Statements a, b, and d are true, whereas c is false.

The views of the British public on the use of personal medical information for the purposes of public health research and surveillance without individual consent were sought. A cross sectional study design was used, with the aim of obtaining a representative sample by taking a cross section of the population.2 The sample was obtained using multistage sampling, which has been described in a previous question with specific reference to the above example.3 The sampling method involves two or more stages of random sampling based on the hierarchical structure of natural clusters within the population. Clusters are natural groupings of people—for example, general practices, schools, or households. A different type of cluster is randomly sampled at each stage, with the clusters nested within each other at successive stages. The final stage of sampling involves choosing a random sample of people in the clusters selected at the penultimate stage. In the above example, multistage sampling consisting of three stages was used: a random sample of postal districts in the UK was obtained, and within each district a random sample of private households obtained, with one adult selected at random in each household.

The unit of sampling, sometimes referred to as the sampling unit, is defined statistically as the “who” or “what” that is sampled. In the above example the postal districts, households, and adults were all sampling units (a is true). The nature of multistage sampling meant that a sampling unit was involved at each stage, with units nested within each other at successive stages. The postal districts would be referred as to the primary sampling unit (b is true), the households the secondary sampling unit, and the adults the ultimate sampling unit. Not all sampling methods are as complex as multistage sampling and some contain only one sampling unit. For example, if convenience sampling had been used, the sample of adults may have been recruited from a local hospital clinic. In that case only one sampling unit would have been involved—that is, the adult patient.

The unit of observation, sometimes referred to as the unit of measurement, is defined statistically as the “who” or “what” for which data are measured or collected. In the above example, one adult was selected at random from each household selected in the second stage, and that adult was asked to provide his or her views on the use of personal medical data for the purposes of public health research and surveillance without individual consent. Therefore, the unit of observation was the adult (c is false).

The unit of analysis is defined statistically as the “who” or “what” for which information is analysed and conclusions are drawn. In the above example, the unit of analysis was the adult (d is true). Because each adult provided only one value for each question surveyed, the data were considered independent, and standard statistical tests could be used to compare treatment groups. Furthermore, conclusions were drawn for the adult.

It was important that the sampling units plus the units of observation and analysis were clearly identified. The sampling units needed to be defined so that at each successive stage in the multistage sampling process the sampling frame—a list of all the units (postal districts, households, or adults) in the clusters selected at the previous stage—could be constructed and a random sample obtained. It is important to define the units of observation and analysis to facilitate the correct analysis and subsequent inference of the study results. For example, in study designs such as cluster randomised controlled trials where clusters of people are randomised to an intervention,4 the unit of observation may be the trial participant. However, because measurements may not be independent within a cluster they might be averaged across a cluster for the purpose of analysis and inference. Therefore, the unit of analysis would be the cluster. If the unit of analysis was the trial participant, it would then be important to account for the possible lack of independence between trial participants and their outcome measurements when undertaking statistical analysis.

In addition to the sampling unit, unit of observation, and unit of analysis, units of randomisation and of intervention also exist. The units are often confused. The sampling unit, unit of observation, and unit of analysis are relevant in all studies, including observational and experimental ones, whereas the unit of randomisation and the unit of intervention are relevant only to experimental studies, such as clinical trials. The unit of randomisation is defined statistically as “who” or “what” is randomised to treatment groups. The unit of intervention is defined statistically as the “who” or “what” for which the intervention is delivered. An overview of all of the types of units will be provided in a future question.

## Notes

Cite this as: BMJ 2015;351:h5396

## Footnotes

• Competing interests: None declared.

View Abstract