Learning In Practice

Detecting cheating in written medical examinations by statistical analysis of similarity of answers: pilot study

BMJ 2005; 330 doi: https://doi.org/10.1136/bmj.330.7499.1064 (Published 05 May 2005) Cite this as: BMJ 2005;330:1064
  1. I C McManus, professor of psychology and medical education (i.mcmanus{at}ucl.ac.uk)1,
  2. Tom Lissauer, officer for examinations2,
  3. S E Williams, psychometrician2
  1. 1 Department of Psychology, University College London, London WC1E 6BT,
  2. 2 Examinations Department, Royal College of Paediatrics and Child Health, London W1W 6DE
  1. Correspondence to: I C McManus
  • Accepted 22 March 2005

Abstract

Objective To assess whether a computer program using a variant of Angoff's method can detect anomalous behaviour indicative of cheating in multiple choice medical examinations.

Design Statistical analysis of 11 examinations held by the Royal College of Paediatrics and Child Health.

Setting UK postgraduate medical examination.

Participants Examination candidates.

Main outcome measures Detection of anomalous candidate pairs by regression of similarity of correct answers in all possible pairs of candidates on the overall proportion of correct answers. Anomalous pairs were subsequently assessed in terms of examination centres and the seating plan of candidates, to assess adjacency.

Results The 11 examinations were taken by a total of 11 518 candidates, and Acinonyx examined 6 178 628 pairs of candidates. Two examinations showed no anomalies, and one examination found an anomaly resulting from a scanning error. The other eight examinations showed 13 anomalies compatible with cheating, and in each pair the two candidates had sat the examination at the same centre, and for six examinations with seating plans, the candidates in the anomalous pairs had been seated side by side. The raw probabilities of the anomalies varied from 3.9x10-11to 9.3x10-30(median = 1.1x10-17), with Bonferroni-corrected probabilities in the range 2.4x10-5to 4.1x10-24(median = 1.6x10-11). This suggests that one anomalous pair is found for every 1000 or so candidates taking this postgraduate examination.

Conclusions This statistical technique identified a small proportion of candidates who had very similar patterns of correctly answered questions. The likelihood is that one candidate has copied from the other, or that there was collusion, or that a technical error occurred in the exams department (as happened in a single case). Analysis of similarities can be used to identify cheating and as part of the quality assurance process of postgraduate medical examinations.

Introduction

“Ninety-two coins spun consecutively have come down heads ninety-two consecutive times… One, probability is a factor which operates within natural forces. Two, probability is not operating as a force. Three, we are now within un-, sub- or super-natural forces. Discuss.”

Tom Stoppard, Rosencrantz and Guildenstern are Dead

Cheating occurs at all levels of education,13 in medicine and elsewhere,48 and postgraduate examinations are unlikely to be exempt. Cheating threatens examination validity and thereby health care. However, conventional invigilation is only partially effective in preventing cheating.1 This paper describes Acinonyx, a computer program which adapts Angoff's validated method for identifying unduly similar answers from pairs of candidates.9 Reasons for excessive similarity include copying and spontaneous or premediated collusion between candidates, perhaps supported by communications technology. Acinonyx cannot distinguish these processes, or determine which candidate has copied from which.

Method

Software—Acinonyx is written in C++ and also uses the REGRESSION program of SPSS to implement a version of Angoff's A index.9 It is applicable to any objectively marked examination (multiple true-false with or without negative marking; best of five; extended matching; etc), requiring only a knowledge of the questions answered correctly by each candidate.

Statistical method—Let candidate I answer Ri questions correctly in an exam, candidate J answer Rj questions correctly, and Rij be the number of correct answers shared by the two candidates. Rij is not a good measure of similarity because the number of similar answers increases with examinee knowledge. Acinonyx follows Angoff9 in examining Rij in relation to Ri and Rj, but assesses the unusualness of Rij by calculating the residual of Rij after regression on Graphic and Graphic. Residuals are distributed normally and expressed as probabilities.

Significance testing—With N candidates there are Nx(N - 1)/2 pairs of candidates (that is, 1 999 000 pairs when N = 2000), making necessary a correction for alpha inflation (multiple significance testing). Acinonyx calculates a raw, uncorrected probability, Praw, which is adjusted for multiple testing by a Bonferroni correction, giving a corrected probability, Pcorrected:

Formula

Alpha is set conservatively at P < 0.001, and a pair regarded as anomalous if Pcorrected is < 0.001. Therefore, for 2000 candidates a pair is anomalous if Praw is < 0.001/1 999 000 = 5.0x10-10.

Centres and seating plans—Acinonyx does not know the seating of candidates (and many are sitting in different centres). Seating plans are used to check the validity of apparently anomalous pairings.

Examination—Until December 2003, Part 1 of the Membership Examination of the Royal College of Paediatrics and Child Health (the MRCPCH) consisted of a single paper. Since 2004 there are two papers, Paper One A (basic child health) and Paper One B (extended paediatrics). All examinations are marked by computer and do not use negative marking.

Results

Here we describe three examinations; further examples can be found on bmj.com.

Examination I—MRCPCH Part 1, 2003/2, had 300 questions and was taken by 1099 candidates. The 63 351 pairs of candidates are plotted in the figure, with Rij plotted vertically and Graphic horizontally. The regression explains 95% of the variance (R = 0.975), and residuals are normally distributed (see statistical appendix on bmj.com), with a range from -5.400 to 5.528, corresponding to raw, one tailed probabilities of 3.3x10-8and 1.6x10-8and corrected probabilities of 0.02 and 0.01, which do not reach the criterion of Pcorrected < 0.001. This examination showed no anomalies, and shows that residuals are normally distributed.

Examination II—MRCPCH 2004/2 Paper One A had 244 questions and was taken by 1298 candidates. The figure shows the 841 753 pairs of candidates; one pair, shown in red, has a standardised residual of 8.6, a raw probability of 1.1x10-17, and a corrected probability of 9.0x10-12, and hence Pcorrected < 0.001. These two candidates, who answered 170 and 178 items correctly, with 164 shared answers, were found on the seating plan to have been seated side by side; one passed and one failed. The latter subsequently took Parts 1A and 1B of the 2004/3 diet (941 and 1084 candidates), and was in the only anomalous pair in each of these examinations (Pcorrected = 4.1x10-24and 3.7x10-21; see bmj.com).

Examination III—Paper One B of MRCPCH 2004/2 had 244 questions and 1251 candidates. One of the 781 875 candidate pairs (figure) had a standardised residual of 7.8 (Pcorrected = 7.1x10-9). The computer file showed 177 and 180 correct items, with 172 shared answers. However, the candidates sat the examination in different cities. Questions are answered on a single response sheet with the 200 multiple true-false questions and 44 other questions scanned separately and the data sets then merged. The first 200 answers were identical. The actual answer sheets showed a scanning error had resulted in one answer sheet inadvertently being entered twice.

Overall results—We analysed 11 consecutive MRCPCH Part 1 examinations (2002/2 to 2004/3 A and B), which were taken by 11 518 candidates, comprising 6 178 628 candidate pairs. Seating plans were available only for the year 2004. One anomalous pair resulted from an administrative error, whereas 13 anomalous pairs were compatible with cheating (one pair for every 886 candidates), although two anomalous pairs consisted of the same two candidates. Pcorrected values for anomalies were in the range 2.4x10-5to 4.1x10-24(N = 13; median = 1.6x10-11). In the six exams where seating plans were available the candidates in each anomalous pair had been seated side by side. Of the 12 independent pairs, both candidates failed in seven cases, one passed in three cases, and both passed in two cases.

Discussion

Acinonyx identifies anomalous pairs of candidates which require investigating (and meet standard forensic requirements for scientific evidence10). Action requires other evidence. Seating plans, notes in question booklets, changed answers, information from invigilators, other surveillance, and interviews with candidates may show culpability.

Examiners raise a number of questions and objections about Acinonyx that are worth considering (see also bmj.com).

Statistical issues—Although “the evidence is only statistical,” statistics are facts and are widely used to guide actions throughout medicine. Although rare events do occur by chance, particularly with large numbers of candidates, Examination I shows that the method effectively eliminates type I errors. The extreme unlikelihood of some of the probabilities is sometimes difficult to interpret, and is better expressed in terms of games of chance: 10-20, for instance, is the likelihood of tossing 64 successive heads, or of winning the UK National Lottery in three successive weeks. Additional statistical support also comes from seating plans: for examination II, with 1298 candidates, the probability that the second member of an anomalous pair was one of the eight seated adjacent to the first is only 1 in (1297/8) = 1 in 162, P = 0.006.

Candidates may give similar answers because they have studied together—If so then anomalous pairs would be found in candidates sitting in different centres, but they are not, here or elsewhere.3

The evidence is only circumstantial— “The rule of probability”11 means that circumstantial evidence can be highly probative, particularly when corroborated by seating plans, coincidences in wrong answers in best of five and extended matching questions, answers erased in favour of another answer, annotation of question booklets, performance in previous examinations, and evidence from invigilators and other candidates.

The sensitivity, specificity, and validity of the technique are not known—Angoff demonstrated that his indices were substantially raised in 50 “known and admitted copiers.”9 Monte Carlo analysis confirms the sensitivity and specificity of Acinonyx (see bmj.com).

Postgraduate examinations should take other steps to prevent cheating—Measures to minimise cheating by close investigation, avoiding tiered lecture theatres or closely placed desks, and other methods should be taken. Acinonyx can itself be used to monitor the effectiveness of prevention.

It is a “victimless crime”—The victims are patients treated inappropriately by improperly qualified doctors. In the United Kingdom, cheating violates the guidelines in Good Medical Practice (“as a doctor you must be honest and trustworthy”),12 and the General Medical Council has already disciplined a doctor for cheating, partly on the basis of statistical evidence.13

Conclusions

Acinonyx identifies anomalous pairs of candidates who are probably cheating, and also acts as a quality control, shown by detecting a scanning error which is so far unique to this and other examinations. Acinonyx does not provide direct evidence about which candidate did the copying, and further investigation is required. Although examining bodies dislike investigating innocent candidates, their obligations to other candidates, the profession, and patients require them to protect the integrity of examinations. Examining bodies should remind candidates of the importance of keeping answers concealed, and bodies that adopt such statistical methods should inform candidates about their use and should have appropriate investigative and disciplinary procedures in place.14

What is already known on this topic

Cheating is common in examinations

Angoff's method is a validated technique for detecting copying in multiple choice examinations

What this study adds

Acinonyx, a computer program incorporating a modified Angoff's method, finds undue similarity (“anomalies”) between pairs of candidates taking a postgraduate medical examination

Anomalous pairs of candidates are seated adjacent to one another, and the similarity probably results from copying

About one anomalous pair is found for every 1000 candidates taking postgraduate examinations

Embedded ImageAn extended version of the paper and information on power calculations and Monte Carlo simulation are on bmj.com

Acknowledgments

We are grateful to a number of colleagues who have discussed the ideas and analyses presented in this paper, and particularly to Mr Bertie Leigh for his detailed comments on the manuscript. The software is available free of charge to non-commercial and educational organisations from i.mcmanus{at}ucl.ac.uk

Footnotes

  • Contributors ICM developed the statistical analysis. Applying it to this examination was initiated by TL and SEW, with SEW assisting with statistical analysis. ICM wrote the first draft of the paper, and all authors contributed to the final draft. ICM is the guarantor for the paper.

  • Funding None.

  • Conflict of interest ICM has given unpaid educational advice to several postgraduate medical examination boards. TL is the honorary officer for examinations for the Royal College of Paediatrics and Child Health. SEW is a paid employee of the RCPCH examinations department.

  • Ethical approval Not required.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
View Abstract