- John L Campbell, professor1,
- Martin Roberts, research fellow1,
- Christine Wright, research fellow1,
- Jacqueline Hill, research fellow1,
- Michael Greco, associate professor2,
- Matthew Taylor, project manager 2,
- Suzanne Richards, senior lecturer1
- 1Peninsula College of Medicine and Dentistry, Exeter EX1 2LU, UK
- 2Client Focused Evaluation Programme (UK), Exeter
- Correspondence to: J L Campbell
- Accepted 26 August 2011
Objectives To investigate potential sources of systematic bias arising in the assessment of doctors’ professionalism.
Design Linear regression modelling of cross sectional questionnaire survey data.
Setting 11 clinical practices in England and Wales.
Participants 1065 non-training grade doctors from various clinical specialties and settings, 17 031 of their colleagues, and 30 333 of their patients.
Main outcome measures Two measures of a doctor’s professional performance using patient and colleague questionnaires from the United Kingdom’s General Medical Council (GMC). We selected potential predictor variables from the characteristics of the doctors and of their patient and colleague assessors.
Results After we adjusted for characteristics of the doctor as well as characteristics of the patient sample, less favourable scores from patient feedback were independently predicted by doctors having obtained their primary medical degree from any non-European country; doctors practising as a psychiatrist; lower proportions of white patients providing feedback; lower proportions of patients rating their consultation as being very important; and lower proportions of patients reporting that they were seeing their usual doctor. Lower scores from colleague feedback were independently predicted by doctors having obtained their primary medical degree from countries outside the UK and South Asia; currently employed in a locum capacity; working as a general practitioner or psychiatrist; being employed in a staff grade, associate specialist, or other equivalent role; and with a lower proportion of colleagues reporting they had daily or weekly professional contact with the doctor. In fully adjusted models, the doctor’s age, sex, and ethnic group were not independent predictors of patient or colleague feedback. Neither the age or sex profiles of the patient or colleague samples were independent predictors of doctors’ feedback scores, and nor was the ethnic group of colleague samples.
Conclusions Caution is necessary when considering patient and colleague feedback regarding doctors’ professionalism. Multisource feedback undertaken for revalidation using the GMC patient and colleague questionnaires should, at least initially, be principally formative in nature.
In recent years, multisource feedback—the process of obtaining feedback from subordinates, peers, and supervisors—has been increasingly used in the business and health sectors to provide valuable information for workers about their performance, and as a means by which managers might stimulate improved performance. Previous research has suggested a complementary role for multisource feedback in performance appraisal.1
Regulatory bodies have the responsibility of monitoring the performance of doctors within their jurisdiction. In the United Kingdom, the General Medical Council (GMC) has proposed that doctors should undergo a process of revalidation, in which a clinician on the GMC register secures a continuing licence to practise on the grounds that they have demonstrated that they are “up to date and fit to practise” medicine.2 All doctors on the GMC register were first issued with licences in 2009; revalidation is expected to be required from late 2012.3
Multisource feedback is seen as a source of valuable evidence to support or refute a doctor’s application to revalidate. Multisource feedback is proposed as a central component in virtually all models of revalidation currently being considered by authoritative bodies in the UK2 3 4 5 and elsewhere.6 7 For doctors who see patients, obtaining patient feedback is envisaged as part of the multisource feedback process.
In 2004, the GMC developed two survey instruments, proposed to support doctors in obtaining feedback from their patients and colleagues. Such approaches seek to assess whether a doctor actually does8 deliver a high standard of professional practice by capturing information from workplace based assessors. We have previously provided evidence9 regarding these instruments. After minor modifications to both instruments, we have also reported on the performance of those instruments in a large sample of doctors practising in various clinical settings in the UK.10 The content of these instruments reflects the principles and values of medical professionalism as set out in the GMC’s authoritative guidance for UK doctors.11
Statistical modelling of feedback about doctors’ professionalism provides an opportunity to examine determinants of professional behaviour, inform processes of data collection, and explore potential predictors of the effect of assessors’ and assessees’ characteristics on performance scores. Many studies from the UK and elsewhere, including our own research, have used regression models to investigate the association between an individual doctor’s performance and various characteristics of both the doctor being assessed and of those individuals providing assessment data.10 12 13 14 These studies have modelled the scores provided by individual raters, giving insight into how assessor and assessee characteristics might affect the ratings. Such studies have highlighted, for example, that less than 15% of the variance in doctors’ scores is accounted for by the extent of familiarity between observer and assessee.6 15
We aimed to investigate potential patient, colleague, and doctor related sources of systematic bias arising in the assessment of doctors’ professionalism. In contrast with approaches based on the analysis of individual rater assessments, we have modelled the scores obtained by doctors as an average across a sample of their patient or colleague raters, allowing us to examine the effect on these average scores arising from variations in the profile of the rater groups and of the doctors themselves.
Detailed methods have been reported elsewhere.10 In summary, all non-training grade doctors from 11 sites in England and Wales were invited to take part between March 2008 and January 2011. The settings included four acute hospital trusts, an anaesthetics department from one acute trust, one mental health trust, four primary care organisations, and one independent sector organisation (that is, not part of the UK National Health Service). We aimed to recruit about 1000-1250 doctors across various practice settings and clinical specialties. We did not base this number on a formal sample size calculation, but rather aimed to obtain a sufficiently large sample to allow psychometric assessment of the data collection instruments.10 We staged doctor recruitment and data collection at each site to avoid overburdening individual departments or practices. An internal communication was sent from the medical director or chief executive encouraging the doctors’ participation. Doctors then received an information pack, containing a reply slip to indicate whether they wished to take part. We issued up to two reminders to non-responders.
Participating doctors were invited to identify up to 20 of their colleagues (half of whom were to be medically qualified) to take part in a secure online survey regarding the professionalism of the doctor. A paper alternative was available for colleague participants. Doctors were also invited to distribute, using administrative support if available, a paper based post-consultation questionnaire and prepaid return envelope to 45 consecutive patients. The patient survey (web appendix 1) comprised nine core items relating to the doctor’s performance, each scored using a five point scale. The colleague survey (web appendix 2) comprised 18 core items, which were also scored with a five point scale, using response options from “poor” (1) to “very good” (5) or from “strongly disagree” (1) to “strongly agree” (5) with higher scores indicating more positive ratings. All items included “don’t know” or “not applicable” as relevant options. The personal characteristics of doctor participants were determined on the basis of self reports of a range of characteristics (table 1⇓).
After data return and cleaning, we calculated a summary patient score for each doctor, provided that at least 22 patient questionnaires had been returned, in line with our original instructions to participants to ensure adequate reliability.9 We obtained the patient summary score by first calculating a mean score for each core item across patients where at least six patients had returned a valid score, and then calculating the mean of these item means where at least five of the possible nine core items means were available. We used a similar approach for feedback from colleagues, where at least eight colleagues had completed a questionnaire about the doctor’s performance and more than half of the possible 18 core item means were available.
Predicting doctor’s patient and colleague scores
We used separate linear regression models to examine the association between a doctor’s summary scores and a range of characteristics of the doctor, and of their patient and colleague samples—one model for patient scores, the other for colleague scores. Table 1 summarises the characteristics tested. We selected characteristics that had been identified as potentially important in pre-existing scientific literature.12 13 14 16 17 18 19 20 21 22 We entered the identified predictor variables into the separate regression models for the patient and colleague scores. We used a significance threshold of P≤0.10 to decide which characteristics of the doctors and rater groups should be included as potential independent predictors of the two mean summary scores in multiple regression models. If small subgroup sizes risked breaching anonymity (for example, in relation to the doctor’s ethnic group), we combined categories of the relevant variables (table 1).
We regarded variables as significant independent predictors of the summary score if, after correcting for other variables in the model, the resulting P value was less than 0.05. We used bootstrapping to check the validity of the Wald based 95% confidence intervals, in view of the non-normality in the residuals, and we checked the regression models for sensitivity to the P≤0.10 threshold for entering potential predictors. We calculated effect sizes for independent predictors in relation to the magnitude of the standard deviation of the respective patient or colleague score.23
Of 2454 invited doctors, 1065 (43%) agreed to take part, returning 30 333 patient questionnaires (mean 32.9 (standard deviation 10.8) per doctor) and 17 031 colleague questionnaires (16.1 (2.7)). Consent from patients and colleagues was indicated by the returning of the questionnaires. For 780 doctors who returned enough questionnaires to derive a patient score, the mean score was 4.80 (standard deviation 0.12, range 3.96-4.99); for 1050 doctors returning enough questionnaires to derive a colleague score, the corresponding score were 4.63 (0.19, 3.57-4.96).
In univariate models, the doctor’s sex, ethnic group, region of primary medical qualification, specialty group, and locum status were significantly associated with variation in patient scores (table 2⇓). The same variables, together with the doctor’s age and current contractual role, were significantly associated with the colleague score. We therefore included these two sets of variables as potential predictors in the respective regression models.
In univariate models, seven patient related variables were significantly associated with the patient score (table 3⇓). Eight colleague related variables were significantly associated with the colleague score (table 4⇓). We also included these two sets of variables as potential predictors in the respective patient-doctor and colleague-doctor regression models.
Predicting the patient summary mean score
Table 5⇓ presents results for the final regression model for the patient summary score, based on data from 718 doctors who provided complete data on all relevant variables. The doctor’s specialty group and region of primary medical qualification, together with the proportions of patients who were white, who regarded their visit as very important, or who were seeing their usual doctor were all independent predictors of patient scores. These predictors explained 21.0% of the variation in those scores. Doctors who had trained in South Asia or in jurisdictions outside the European Economic Area were likely to score lower on patient feedback than doctors trained in the UK. Psychiatrists were predicted to score lower than the general practitioner reference group. Increases in the proportions of patients who reported themselves as white, who regarded their visit as very important, or who reported seeing their usual doctor were all associated with increases in patient summary scores.
A large effect on patient feedback (effect >0.823×patient score standard deviation of 0.120) was evident for doctors from the psychiatry specialty group. After controlling for other variables in the analysis, psychiatrists were predicted to score 0.123 points lower than general practitioners and 0.143 points lower than doctors from other medical specialties. A large effect on patient score would also be expected to result from a 64% increase in the proportion of white patients in the sample, and from a 53% increase in the proportion of patients who regarded their visit as very important. Medium effects on patient scores were predicted for doctors who obtained their primary medical qualification in South Asia, and for a 63% increase in the proportion of patients reporting that they were seeing their usual doctor. Other effect sizes in respect of patient scores were small or not significant.
Predicting the colleague related mean score
Table 6⇓ shows results of the regression modelling for the colleague summary score, based on data from 949 doctors who provided complete data on all relevant variables. The doctor’s specialty group, region of primary medical qualification, current contractual role, and locum status, together with the proportion of colleagues who reported daily or weekly contact with the doctor, were all independent predictors of the colleague summary score, together explaining 16.7% of the variation in those scores.
After controlling for other variables in the analysis, doctors trained outside the UK, except for those trained in South Asia, were likely to score lower than UK trained doctors. Consultants and general practitioners were likely to score 0.074 points higher than doctors in other contractual roles, whereas doctors in locum posts were likely to score 0.093 points lower than those in permanent positions. Doctors in medical, surgical, and other specialty groups were predicted to score higher than the general practitioner reference group (by 0.091, 0.063, and 0.064 points, respectively). An increase in the proportion of colleagues reporting familiarity with the doctor’s performance, based on daily or weekly contact with the doctor, was associated with an increase in colleague scores.
We did not see any large effects on colleague score arising from any of the variables examined. However, medium effects (effect >0.523×colleague score standard deviation of 0.194) were evident for doctors from medical, surgical, and other specialties compared with psychiatrists, and for a 70% increase in the proportion of colleagues reporting daily or weekly contact with the doctor during their period of familiarity.
Summary of main findings
Using information obtained from the patients and colleagues of participating doctors, we found systematic variation in results of professionalism assessments among doctors working in a range of clinical settings and drawn from different clinical specialties. Some of the differences in doctors’ scores after feedback from their patients and colleagues were attributable to differences between participating doctors in their personal and occupational characteristics. In addition, some of the differences in doctors’ scores were attributable to variation between doctors in the characteristics and sociodemographic mix of their patients or colleagues in the feedback sample. These findings suggest that some doctors could be at risk of obtaining lower or higher scores based on sampling bias, rather than on the true variation between doctors in respect of their professional performance.
Strengths and limitations
The research had several strengths. Firstly, our findings were based on a large sample of doctors with varying personal characteristics, drawn from several clinical settings and specialties. Furthermore, the patients and colleagues providing feedback varied widely in respect of their sociodemographic characteristics and in the nature of their relationship with the participating doctor. We have reported elsewhere10 on the apparent acceptability of the multisource feedback process, as suggested by low levels of missing questionnaire data and high levels of assessor participation, and by the similar distribution in age and sex between doctors who were participants and those who were not. Finally, using regression models, we have identified a range of variables which independently predict doctors’ scores after taking account of other variables in statistical models of doctors’ professionalism. We have undertaken comprehensive modelling of the professionalism of fully trained doctors, taking account of both the characteristics of the doctor being assessed, and the characteristics of the sample of patient or colleague assessors.
In view of the current status of revalidation proposals in the UK, the study was, inevitably, based on a volunteer sample of doctors. We were reassured by the observed participation rate among all invited doctors (43%); although this rate was in excess of some other national level studies of doctors volunteering for multisource feedback,6 24 we recognise that we might not have captured the full range of performance with respect to professionalism. In addition, to protect the anonymity of doctor participants, we incorporated data relating to some doctors from small groups into larger groups before analysis. This was done, for example, for the small number of doctors reporting black ethnic status, whose feedback was incorporated with doctors from “other” ethnic groupings.
Our models accounted for nine characteristics of the doctor whose professionalism was being assessed. Only two characteristics—the region of primary medical qualification and clinical specialty—were independent predictors of scores after patient feedback after also accounting for the mix of the patients providing feedback. In particular, doctors qualifying outside of Europe had lower patient feedback scores, as did psychiatrists.
Four doctor characteristics predicted colleague feedback: the region of primary medical qualification, clinical specialty, current contractual role, and locum status. Doctors who received lower feedback scores from their colleagues were those qualifying outside of the UK or South Asia, those working in locum posts, and those not working as a general practitioner or in a consultant role (such as doctors in associate specialist or staff grade roles). General practitioners and psychiatrists received reduced scores overall from their colleagues, compared with hospital based doctors.
It is perhaps gratifying that in modern day Britain with its tradition of equality legislation, the age, sex, and ethnic group of the doctor were not independent predictors of feedback scores from patients or colleagues. However, we found weak evidence suggesting that a doctor’s age and ethnic group were predictive of colleague feedback. Older doctors tended to have lower colleague feedback scores than younger doctors, and doctors of Asian ethnic origin had lower scores than those from white or other ethnic groups. To what extent these observations relate to true differences in performance as opposed to systematic variation in assessments based on non-clinical considerations is a matter of importance, and one which we cannot address in this study.
Patient and colleague samples
We assessed the contribution of six characteristics of the patient feedback sample as potential predictors of overall patient summary scores in a model which also adjusted for the characteristics of the doctor being assessed. The proportions of white patient participants, patients identifying the reason for their consultation with the doctor as being very important, and patients reporting that they were seeing their usual doctor were independent predictors of more favourable patient scores. Neither the age or sex profiles of patient respondents, nor the proportion of respondents providing feedback as a proxy for the patient (for example, as a carer, or parent of a child), were predictors of patient feedback.
Using these data, we have been able to predict the effect of changes in the sociodemographic profile of the patient sample on doctors’ professionalism scores that might occur in doctors with a proportion of non-white patients that is higher than average. Our data identified that some doctors had no non-white patients in their sample, whereas for others, all of the patients providing feedback were from non-white ethnic groups. In addition, although many patients prefer continuity of care from their doctor,25 26 fewer achieve this aspiration.27 The dissonance between patients’ aspirations for continuity of care and their experience of care could, at least partly, be reflected in the reduced scores for professionalism attributed to doctors by patients who judged that they were not seeing their usual doctor.
Of eight characteristics of the colleague sample investigated as potential predictors of colleague scores, only one characteristic—the proportion of colleagues reporting that they had daily or weekly contact with the doctor being assessed during their period of familiarity with the doctor’s clinical practice—was a predictor of more favourable colleague feedback. Although this observation accords with findings reported by others,14 16 17 18 Hall and colleagues6 observed a negative effect of familiarity on ratings.
Seven other characteristics were not independent predictors of colleagues’ feedback scores, including the age, sex, and ethnic profile of the colleague sample; the proportion of colleague respondents who were in medical, other clinical, or administrative or managerial roles; the proportion of medically qualified colleagues who were in training grades; the proportion of colleagues who currently worked with the index doctor; and the proportion of colleagues returning their feedback using a paper questionnaire.
Policy and practice implications
The UK regulator of medical practice, the GMC, has proposed major changes to the regulation of doctors, which are the most important changes to be introduced since the establishment of the GMC in 1858. Central to the proposed model for the revalidation of doctors are strengthened systems of appraisal, the appointment of “responsible officers” with a statutory role in “overseeing the evaluation of fitness to practise, and monitoring the conduct and performance of doctors,”28 and the need for doctors to present evidence that they are “up to date and fit to practice.” Multisource feedback from colleagues, and, where appropriate, from patients, is seen as an important potential source of such evidence.
Although various clinical specialty groups could propose a range of evidence for an appraisal portfolio,29 30 many doctors will probably seek, or be required to incorporate, feedback from patients and colleagues. Clinical specialty guidance should be based on authoritative evidence, recognising both the strengths and limitations of various approaches to providing evidence in relation to a doctor’s professionalism. The GMC has committed itself to issuing guidance to doctors on the use of questionnaires, and has noted the importance of using questionnaires that link to authoritative guidance on appropriate modern medical practice and that meet predetermined psychometric standards.2
Our data highlight the need for guidance for doctors in respect of identifying appropriate samples of colleagues and patients, and, importantly, the need for guidance for responsible officers in interpreting and responding to feedback on doctors’ professionalism. In particular, our data suggest that systematic bias might be responsible for at least some of the differences in the assessment of doctors’ performances, but this observation can only be confirmed by use of an objective measurement of professionalism. Adjusting scores to take account of the case mix might be an appropriate and potentially important response to these observations, facilitating interpretation of a doctor’s scores. Therefore, we advise careful consideration of the evidence which doctors might submit relating to their professionalism, and caution in developing judicious and appropriate responses to evidence which suggest a doctor’s performance to be unusual. Use of multisource feedback to support revalidation should at least initially be largely formative in nature and intent, and undertaken within the context of strengthened systems of appraisal.
What is already known on this topic
The GMC has proposed that UK doctors undergo revalidation to secure a continuing licence by demonstrating that they are “up to date and fit to practise” medicine
Multisource feedback from patients and colleagues is seen as an important source of evidence to support or refute a doctor’s application to revalidate
What this study adds
Systematic bias may exist in the assessment of doctors’ professionalism arising from the characteristics of the assessors giving feedback, and from the personal characteristics of the doctor being assessed
In the absence of a standardised measure of professionalism, doctors’ assessment scores from multisource feedback should be interpreted carefully
Multisource feedback, for the purposes of supporting revalidation, should at least initially be largely formative in nature and intent, and undertaken within the context of strengthened systems of appraisal
Cite this as: BMJ 2011;343:d6212
We thank the doctors who contributed to this study, along with their patients, colleagues, and supporting administrative staff, for their cooperation and support; Tina Bealing and Louise Coleman (both of the Client Focused Evaluation Programme UK (CFEP-UK)) for their support of the project; Professor Martin Roland and Dr Obi Ukoumunne for their comments on earlier drafts of the research; and Helen Forster for proofreading the manuscript.
Contributorship: JC and MG conceived the study, and with SR and CW, developed the study design. JC is guarantor of the data. MG and MT oversaw data collection and initial processing. CW and JH monitored data collection. MR and JC undertook analysis. JC drafted the paper; all authors contributed to interpretation of the data and revision drafting of the text.
Funding: The study was funded by the UK GMC as an unrestricted research award. JC is an adviser to the GMC and has received only direct costs associated with presentation of this work. MG is a director of CFEP-UK and provided survey administration in respect of this research; MT was an employee of CFEP-UK at the time the research was undertaken. The study sponsor did not have any role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the article for publication.
Competing interests: All authors have completed the Unified Competing Interest form at http://www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: the study was funded by the UK GMC as an unrestricted research award; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years, no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: The study was considered by the Devon and Torbay NHS research ethics committee but judged not to require a formal ethics submission.
Submission: The submission of this paper conforms with the STROBE guidelines for cross sectional research studies (http://www.strobe-statement.org/index.php?id=available-checklists).
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.