Jump to: Page Content, Site Navigation, Site Search,
You are seeing this message because your web browser does not support basic web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.
|
Statistics at Square One 2. Mean and standard deviationThe median is known as a measure of location; that is, it tells
us where the data are. As stated in , we do not need to know all
the exact values to calculate the median; if we made the smallest
value even smaller or the largest value even larger, it would
not change the value of the median. Thus the median does not use
all the information in the data and so it can be shown to be less
efficient than the mean or average, which does use all values
of the data. To calculate the mean we add up the observed values
and divide by the number of them. The total of the values obtained
in Table 1.1 was 22.5
As well as measures of location we need measures of how variable the data are. We met two of these measures, the range and interquartile range, in Chapter 1. The range is an important measurement, for figures at the top and bottom of it denote the findings furthest removed from the generality. However, they do not give much indication of the spread of observations about the mean. This is where the standard deviation (SD) comes in. The theoretical basis of the standard deviation is complex and need not trouble the ordinary user. We will discuss sampling and populations in Chapter 3. A practical point to note here is that, when the population from which the data arise have a distribution that is approximately "Normal" (or Gaussian), then the standard deviation provides a useful basis for interpreting the data in terms of probability. The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population. The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population. However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape. Many biological characteristics conform to a Normal distribution
closely enough for it to be commonly used - for example, heights
of adult men and women, blood pressures in a healthy population,
random errors in many types of laboratory measurements and biochemical
data. Figure 2.1 shows a Normal curve calculated from the diastolic blood pressures
of 500 men, mean 82 mmHg, standard deviation 10 mmHg. The ranges
representing
Figure 2.1 Normal curve calculated from diastolic blood pressures of 500 men, mean 82 mmHg, standard deviation 10 mmHg.
The reason why the standard deviation is such a useful measure
of the scatter of the observations is this: if the observations
follow a Normal distribution, a range covered by one standard
deviation above the mean and one standard deviation below it (
Standard deviation from ungrouped dataThe standard deviation is a summary measure of the differences of each observation from the mean. If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero. Consequently the squares of the differences are added. The sum of the squares is then divided by the number of observations minus oneto give the mean of the squares, and the square root is taken to bring the measurements back to the units we started with. (The division by the number of observations minus oneinstead of the number of observations itself to obtain the mean square is because "degrees of freedom" must be used. In these circumstances they are one less than the total. The theoretical justification for this need not trouble the user in practice.)To gain an intuitive feel for degrees of freedom, consider choosing a chocolate from a box of n chocolates. Every time we come to choose a chocolate we have a choice, until we come to the last one (normally one with a nut in it!), and then we have no choice. Thus we have n-1 choices, or "degrees of freedom". The calculation of the variance is illustrated in Table 2.1 with the 15 readings in the preliminary study of urinary lead concentrations (Table 1.2). The readings are set out in column (1). In column (2) the difference between each reading and the mean is recorded. The sum of the differences is 0. In column (3) the differences are squared, and the sum of those squares is given at the bottom of the column.
The sum of the squares of the differences (or deviations) from the mean, 9.96, is now divided by the total number of observation minus one, to give the variance.Thus,
Calculator procedure Mullee (1) provides advice on choosing and using a calculator. The calculator formulas use the relationship
The right hand expression can be easily memorised by the expression
mean of the squares minus the mean square". The sample variance
The above equation can be seen to be true in Table 2.1, where the sum of the square of the observations,
the same value given for the total in column (3). Care should be taken because this formula involves subtracting two large numbers to get a small one, and can lead to incorrect results if the numbers are very large. For example, try finding the standard deviation of 100001, 100002, 100003 on a calculator. The correct answer is 1, but many calculators will give 0 because of rounding error. The solution is to subtract a large number from each of the observations (say 100000) and calculate the standard deviation on the remainders, namely 1, 2 and 3.
Standard deviation from grouped dataWe can also calculate a standard deviation for discrete quantitative variables. For example, in addition to studying the lead concentration in the urine of 140 children, the paediatrician asked how often each of them had been examined by a doctor during the year. After collecting the information he tabulated the data shown in Table 2.2 columns (1) and (2). The mean is calculated by multiplying column (1) by column (2), adding the products, and dividing by the total number of observations.
As we did for continuous data, to calculate the standard deviation we square each of the observations in turn. In this case the observation is the number of visits, but because we have several children in each class, shown in column (2), each squared number (column (4)), must be multiplied by the number of children. The sum of squares is given at the foot of column (5), namely 1697. We then use the calculator formula to find the variance:
and
Note that although the number of visits is not Normally distributed, the distribution is reasonably symmetrical about the mean. The approximate 95% range is given by
This excludes two children with no visits and six children with six or more visits. Thus there are eight of 140 = 5.7% outside the theoretical 95% range. Note that it is common for discrete quantitative variables to have what is known as skeweddistributions, that is they are not symmetrical. One clue to lack of symmetry from derived statistics is when the mean and the median differ considerably. Another is when the standard deviation is of the same order of magnitude as the mean, but the observations must be non-negative. Sometimes a transformation will convert a skewed distribution into a symmetrical one. When the data are counts, such as number of visits to a doctor, often the square root transformation will help, and if there are no zero or negative values a logarithmic transformation will render the distribution more symmetrical.
Data transformationAn anaesthetist measures the pain of a procedure using a 100 mm visual analogue scale on seven patients. The results are given in Table 2.3, together with the log etransformation (the ln button on a calculator).
The data are plotted in Figure 2.2, which shows that the outlier does not appear so extreme in the logged data. The mean and median are 10.29 and 2, respectively, for the original data, with a standard deviation of 20.22. Where the mean is bigger than the median, the distribution is positively skewed. For the logged data the mean and median are 1.24 and 1.10 respectively, indicating that the logged data have a more symmetrical distribution. Thus it would be better to analyse the logged transformed data in statistical tests than using the original scale. Figure 2.2 Dot plots of original and logged data fom pain scores
. In reporting these results, the median of the raw data would be
given, but it should be explained that the statistical test wascarried
out on the transformed data. Note that the median of the logged
data is the same as the log of the median of the raw data - however,
this is not true for the mean. The mean of the logged data is
not necessarily equal to the log of the mean of the raw data.
The antilog (exp or Between subjects and within subjects standard deviationIf repeated measurements are made of, say, blood pressure on an individual, these measurements are likely to vary. This is within subject, or intrasubject, variability and we can calculate a standard deviation of these observations. If the observations are close together in time, this standard deviation is often described as the measurement error.Measurements made on different subjects vary according to between subject, or intersubject, variability. If many observations were made on each individual, and the average taken, then we can assume that the intrasubject variability has been averaged out and the variation in the average values is due solely to the intersubject variability. Single observations on individuals clearly contain a mixture of intersubject and intrasubject variation. The coefficient of variation(CV%) is the intrasubject standard deviation divided by the mean, expressed as a percentage. It is often quoted as a measure of repeatability for biochemical assays, when an assay is carried out on several occasions on the same sample. It has the advantage of being independent of the units of measurement, but also numerous theoretical disadvantages. It is usually nonsensical to use the coefficient of variation as a measure of between subject variability.
Common questionsWhen should I use the mean and when should I use the median to describe my data?It is a commonly held misapprehension that for Normally distributed data one uses the mean, and for non-Normally distributed data one uses the median. Alas this is not so: if the data are Normally distributed the mean and the median will be close; if the data are not Normally distributed then both the mean and the median may give useful information. Consider a variable that takes the value 1 for males and 0 for females. This is clearly not Normally distributed. However, the mean gives the proportion of males in the group, whereas the median merely tells us which group contained more than 50% of the people. Similarly, the mean from ordered categorical variables can be more useful than the median, if the ordered categories can be given meaningful scores. For example, a lecture might be rated as 1 (poor) to 5 (excellent). The usual statistic for summarising the result would be the mean. In the situation where there is a small group at one extreme of a distribution (for example, annual income) then the median will be more "representative" of the distribution. My data must have values greater than zero and yet the mean and
standard deviation are about the same size. How does this happen?
References
ExercisesExercise 2.1 Exercise 2.2 Exercise 2.3 Back to contents | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What can you learn from this BMJ paper? Read Leanne Tite's Paper+