Re: Revalidation seems to add little to the current appraisal process
Although revalidation may improve standards by encouraging reflection and CPD activities, Nigel Hawkes’ article leaves the reader with the view that the revalidation process will be ineffective as ‘doc does not eat doc’. Of greater concern, however, and by far the biggest threat to individual doctors, the NHS and the tax payer is the possibility that ‘big doc will inadvertently eat little doc’.
The use of validated and reliable tools to collect data such as multisource feedback should prevent unfair play, but incorrect application and analysis can render these tools unsafe.
Multisource feedback (MSF) assessment refers to the collection of data from colleagues or patients regarding clinical performance using a questionnaire, and is used internationally as part of both revalidation and remediation assessment procedures[1 , 2]. The design of the MSF has been the subject of several studies to determine validity and reliability.
These studies have focused on the questions asked to ensure firstly that they map on to generic competences across the domains of Good Medical Practice, and secondly the number of questions asked and the number of responses required to give a reliable result. In addition, practical aspects such as time to complete the questionnaire and cost to administer have been investigated. However, very little information is available in the literature concerning what could be argued to be the most important aspect, namely the way in which the data are analysed and reported. For the purposes of revalidation, MSF results are intended, at least initially, to be formative in nature, and guidance advises that caution is necessary when interpreting the results. The application of MSF in current remediation performance assessments by the National Clinical Assessment Service (NCAS) in the UK, however, uses data to calculate a numerical score. Serious concerns have come to light regarding the application of MSF in these assessments, particularly with regard to data analysis.
Multisource feedback assessment scoring system
The Sheffield Peer Rating Assessment Tool (SPRAT), as used in remediation performance measurements, asks colleague assessors to rate the practitioner as 1 (very poor), 2 (poor), 3 (needs development), 4 (satisfactory), 5 (good) or 6 (very good) for each of 30 questions across the domains of Good Medical Practice. The responses are used to calculate a mean score for each question, expressed to two decimal places, which implies a degree of accuracy which in fact does not exist. This mean score is given in the report and also translated back to a category to describe the doctor, using the scoring system shown in Box 1.
The scoring system may seem reasonable at first sight but on further scrutiny is plainly incorrect. For example, if 10 people were asked to rate performance and 9 gave a rating of 5 (good) and one person gave a rating of 4 (satisfactory) the mean (average) would be 4.9 and using the scoring system shown in Box 1 this would translate to ‘satisfactory’ when in fact 9 out of 10 had described performance as ‘good’. Clearly it can be seen that this is not a satisfactory representation of the information collected.
In addition to problems with devising a scoring system, the use of a mean score is suspect and problematic for other reasons. A mean value should only be used where data show a normal distribution. Although this might usually be expected, under the conditions of a poor performance assessment where problems may have arisen for various reasons, a normal distribution of data is not necessarily the case. Box 2 shows data that may have been collected using MSF for an individual doctor in a performance assessment. The graph is a frequency plot and in this case shows that the majority of colleagues rate the doctor as ‘good’ or ‘very good’ for the question relating to ‘Gather relevant data to make a sound clinical judgement’ but using the scoring system given in Box 1 the mean value translates to ‘satisfactory’ and therefore clearly does not describe the data well. In fact where distribution of responses is skewed, as in the example given, the data may hold very important information about the circumstances surrounding the doctor’s position; however this does not appear to be investigated further in current performance assessments for remediation purposes2.
Choice of colleague assessors
The choice of assessors is an area that needs careful consideration and gives further cause for concern. In the SPRAT tool, as used currently in UK remediation performance assessment, the practitioner is asked to select 10 colleagues and the referring body asked to select 10 colleagues. All the assessors know that they are completing the form as part of a ‘potentially poor performance’ assessment, and it is reasonable to suspect that some may be subconsciously biased as the doctor has already been labelled as ‘poor’ by the regulatory authorities. Apart from identifying duplicate nominations, no other exclusion criteria appear to be used. However, those who are not impartial to the outcome should not be included - for example the colleague/colleagues who made the original complaint against the practitioner - as their judgement is also being assessed by the process.
The recent publication by Archer & McAvoy (2011) suggested that practitioners should not be allowed to self-select as this may lead to more favourable results. Interestingly this paper did not comment on the fact that the referring body is likely to select colleagues who will agree with their prior judgement of poor performance. It may be inevitable and unavoidable that the practitioner will be more likely to select assessors that they think will be generous towards them, and the assessors selected by the referring body will tend to score lower than the true position. It is important in this situation that there is equal potential to influence the score in a positive way as there is to influence it in a negative way. For this to be the case the number of categories below ‘satisfactory’ should equal the number of categories above ‘satisfactory’. In the present form of the SPRAT there are 3 categories below ‘satisfactory’ but only 2 categories above, allowing greater capacity to lower the mean score. (Box 1)
Reporting of patient feedback data
The same scientifically flawed system is used to analyse the patient feedback data. A further concern with the patient feedback results is that the categories the patient is given to describe the doctor are not the same as the categories used to report the results (see Box 3). For example if 59 patients thought that the doctor was ‘same as most doctors’ and one patient scored the doctor as 2, the performance assessment report would conclude that the patients scored the doctor as ‘poor/needs development’.
Given the ease with which MSF numerical data can be misrepresented it is important that responsible officers, and indeed all doctors, are aware of the potential pitfalls. Furthermore in the light of these serious flaws, a thorough review of current remediation performance assessment processes is required as a matter of extreme urgency. After all, we demand that our doctors practice with the highest standards of scientific rigour: shouldn’t our doctors expect that when they are assessed the same scientific standards will apply?
1. Campbell JL, Roberts M, Wright C, Hill J, Greco M, Taylor M, et al. Factors associated with variability in the assessment of UK doctors' professionalism: analysis of survey results. BMJ 2011;343:d6212.
2. Archer JC, McAvoy P. Factors that might undermine the validity of patient and multi-source feedback. Medical education 2011;45(9):886-93.
3. Campbell J, Wright C. (ed.) GMC Multi-Source Feedback Questionnaires. Interpreting and handling multisource feedback results: Guidance for appraisers. General Medical Council. 2012. http://www.gmc-uk.org/Information_for_appraisers.pdf_48212170.pdf
Competing interests: No competing interests