Commentary: Classification and cluster analysis

BMJ 1995; 311 doi: (Published 26 August 1995) Cite this as: BMJ 1995;311:535
  1. B S Everitt, professor of statistics in behavioural sciencea
  1. aInstitute of Psychiatry, London SE5 8AL

    One of the most basic abilities of living creatures involves the grouping of similar objects to produce a classification. As well as being a basic human conceptual activity, classification is also fundamental to most branches of science. In chemistry, for example, the classification of the elements in the periodic table has had a profound impact on the understanding of the structure of the atom. Classification in medicine is equally important, with the classification of diseases being of primary concern as the basis for investigations of aetiology and treatment.

    Statistical techniques for classification are essentially of two types. Members of the first type are used to construct a (hopefully) sensible and informative classification of an initially unclassified set of data; these are known as cluster analysis methods. The information on which the derived classification is based is generally a set of variable values recorded for each patient or individual in the investigation, and clusters are constructed so that individuals within clusters are similar with respect to their variable values and different from individuals in other clusters. Paykel and Rassaby, for example, studied 236 people who had attempted suicide presenting at the main emergency service of one American city.1 Each patient was described by 14 variables including age, number of previous suicide attempts, and severity of depression. A number of clustering methods were applied to the data and a final classification with three groups was produced which appeared potentially valuable as a basis for future studies into the causes and possible treatment of attempted suicide. (The second set of statistical techniques concerned with classification is known as discriminant or assignment methods. Here the classification scheme is known a priori and the problem is how to devise rules for allocating unclassified individuals to one or other of the known classes.)

    Many methods have been suggested for constructing a set of clusters from numerical data, ranging from the visual examination of simple scatterplots of data (fig) to the fitting of complex statistical models. A comprehensive review is given in Everitt.2 Among these methods, no single method is best in all situations and for all types of data. Different methods of cluster analysis can (and probably will) produce different solutions when applied to the same set of data. As a consequence, clustering techniques require great care in their practical application if misleading and artificial solutions are to be avoided.


    Data can sometimes be clustered simply by visual examination

    The set of clusters produced by a clustering method can serve a variety of purposes. At one level they may simply represent a convenient method for organising a large dataset so that information may be retrieved more efficiently. In general, however, investigators are likely to be looking for their solutions to represent some more fundamental feature of the data, which is why the caveats offered above are necessary to prevent artefact becoming theory.

    Peacock, Bland, and Anderson use cluster analysis in an attempt to identify women at risk of preterm birth. Seven binary variables are used to characterise each woman in the study who spontaneously delivered preterm. The method of cluster analysis used, association analysis, is monothetic--that is, division of the data into clusters is based on the possession or otherwise of a single specified characteristic. Division variables are chosenin a particular way, as described in the paper. The authors report a solution with three clusters, although the first two of these seem to be similar.

    The use of cluster analysis in such a study is certainly a novel way of attacking the problem of identifying subsets of risk factors among highly correlated explanatory variables. It is unfortunate that the authors used a rather outdated clustering method, one based on divisions involving a single variable at each stage. It would be interesting to compare the results obtained with those given by a method such as latent class analysis,2 which is also specifically designed for binary variables, but is underlined by a relatively sensible statistical model. Nevertheless, Peacock, Bland, and Anderson are to be congratulated for alerting other investigators to the advantages of using cluster analysis in this type of study.


    View Abstract