Do general practitioners act consistently in real practice when they meet the same patient twice? examination of intradoctor variation using standardised (simulated) patientsBMJ 1997; 314 doi: https://doi.org/10.1136/bmj.314.7088.1170 (Published 19 April 1997) Cite this as: BMJ 1997;314:1170
- a Centre for Research on Quality Assurance in General Practice, Department of General Practice, University of Limburg, PO Box 616, 6200 MD Netherlands
- b Department of General Practice, University of Trondheim, Norway
- Correspondence to: Dr Rethans
- Accepted 17 February 1997
Objective: To assess the variation within individual general practitioners facing the same problem twice in actual practice under unbiased conditions.
Design: General practitioners were consulted during normal surgery hours by a standardised patient portraying a patient with angina pectoris. Six weeks later the same general practitioners were consulted again by a similar standardised patient portraying a similar case. The patients reported on the consultations.
Setting: Trondheim, Norway.
Subjects: Of 87 general practitioners invited by letter, 28 (32%) agreed to participate without hesitation; nine others (10%) wanted more information before consenting. From these 24 were selected and visited.
Main outcome measures: Number of actions undertaken from a guideline in both rounds of consultations. Duration of consultations.
Results: The mean (range, interquartile range) guideline score, total score, and duration of consultation were not significantly different between the first and second patient encounters for the group as a whole. For individual doctors the mean (SD) difference was −0.09 (3.36) for the guideline score, 0.30 (8.1) for the total score, and −0.87 (9.01) for consultation time.
Conclusions: The study shows that assessment of performance in real practice for a group of general practitioners is consistent from the first round of consultations to the second round. However, significant variation occurs in performance of individual physicians.
Variation in the performance of doctors is a potential problem in ensuring patients receive agreed best standards of care
This study assesses the intradoctor variation in treating two standardised patients presenting with similar conditions in real practice
For a group of general practitioners performance in the two consultations was consistent
The performance of individual doctors differed when facing the same problem twice
Variation between doctors is a reflection of the individual's art of medicine but may also be a threat to the scientific basis of practice.1 Variation in performance may be studied between countries,2 regions,3 hospitals,4 practices, and doctors.5 6 To try to minimise the variation between doctors national bodies have produced guidelines for good medical practice, both for medical specialties and general practice.7
Variation of performance is an important consideration in assessment of competence of general practitioners. The performance of doctors varies across different medical problems.8 For example, a doctor's performance in dealing with a patient with a urinary tract infection does not predict his or her performance with a patient with diarrhoea. This phenomenon has been labelled content specificity9 and is one of the main reasons why doctors are examined on different areas of medicine and with different problems.10
When assessing doctors' management of a single problem we need to know whether the doctor consistently performs to the assessed standard. Intradoctor (or intraobserver) variation may lead to different results when a doctor is faced with an identical problem twice. Few studies have addressed this problem, and their results are ambiguous. When medical students and specialists were presented with a clinical problem twice by standardised patients the correlation was only 0.60 between the two presentations.11 With medical students test-retest reliability on the same station of an objective structured clinical examination was 0.66-0.88.8 In a study with two independent clinical assessments by a single clinician (three months apart) of the same set of 100 fundus photographs, 88 of 100 patients received identical assessment.12 Repetition of identical tasks by medical students within the same exam did not improve their scores.13 However, these studies were run in examination laboratory settings and may be biased since the subjects knew they were being tested and were likely to recognise the second presentation. In addition, performance under examination circumstances may differ from performance in practice.14 To overcome these problems we did a study to find out whether and to what extent intradoctor variation–that is the variation within doctors facing a similar problem twice–in real life general practice exists under unbiased conditions.
Subjects and methods
We used standardised patients for this study because this method has proved to be reliable, valid, feasible, and acceptable in general practice.15 16 A standardised role of an elderly patient with angina pectoris was constructed. The role focused on the medical history with no abnormal physical signs and normal laboratory and electrocardiographic findings. Two healthy women, aged 69 and 70, were selected as standardised patients and paid to participate. They signed written consent to keep all medical and personal information about the general practitioners in the project strictly for research purposes.
The patients were trained to present a standardised complaint and to score history taking, physical and laboratory examination, instructions given to the patient, treatment, and follow up against a guideline on managing angina pectoris. This guideline was based on relevant general practice literature (such as the guidelines of the Dutch College of General Practitioners) and discussed with two experienced general practitioners and an experienced cardiologist.17 The guideline contained only items considered necessary to manage angina pectoris as presented by the standardised patients.
To ensure the reliability and consistency of scoring by the standardised patients we used standard procedures.18 19 In brief, reports of standardised patients during training (before and between the first and second round) were compared with reports of a panel of doctors about the same consultation. These reliability and consistency κ scores were 0.85 (maximum κ=1.0). Several scores were used to assess the performance of the general practitioners. Firstly, a guideline score–that is, the number of items of the guideline performed by the general practitioner in a consultation. Secondly, a total score–that is, all items (guideline plus non-guideline items) performed by a general practitioner in a consultation. Patients also recorded the duration of visits in minutes using a wristwatch with stopwatch facilities.
One year before the actual visits all 87 general practitioners in Trondheim, Norway, were informed by letter about the objectives of the study and invited to give written acceptance of standardised patients into their practices. The dates, number, and content of the visits were not mentioned. For budgetary reasons it was decided beforehand that 24 general practitioners would participate.
Patients took their original health insurance identifying papers and enlisted in the practices of the selected general practitioners by using techniques reported earlier.16 20 The general practitioners were visited by the standardised patients in two rounds in March and May 1994. Patient A visited 12 of them in the first round and the other 12 in the second round, while patient B visited the doctors in the reverse order. All participating general practitioners were presented with similar standardised presentations twice.
The Wilcoxon signed rank test (paired design) was used to look for differences in the doctors' performances in the first and second round. To assess intradoctor variation, the scores of individual doctors on the two rounds were analysed by the Bland and Altman method.21 The Wilcoxon signed rank test (paired design) was used to assess whether the two standardised patients showed any consistent difference in the way they scored for consultations for the guideline score (the most important score).
Of the 87 doctors asked to participate, 53 (61%) replied. Twenty-eight (32%) answered yes without any further information; nine others asked for more information before agreeing.We selected 24 doctors from those that agreed. After a visit in the second round one general practitioner reported having detected the patient. This left 23 general practitioners and 46 visits for analysis.
Table 1) shows the performance of general practitioners for each item of the guideline in each consultation. Table 2) gives the guideline and total scores and consultation times in the two rounds. We found no significant difference between the first and second round for any of the items or scores assessed. However, to assess intradoctor variation the scores of individual physicians during the first round have to be compared with their individual scores during the second round. This is indicated by the standard deviations in table 2). For example, the standard deviation of the guideline score is 3.36, suggesting that the average within doctor difference for number of guideline items scored is around 3; the average inconsistency in total score is around 8 and the average difference in length of consultation around 9 minutes. These data indicate substantial intradoctor variation between the two rounds. Means (interquartile range) of the guideline scores for the two standardised patients were 16.22 (14 to 19) and 16.04 (14 to 18). These were not significantly different by Wilcoxon signed rank test (paired design), suggesting the two patients showed no consistent differences.
We believe that this is the first study of intradoctor variation in real practice using standardised patients presenting similar problems. This design is the only way to ensure subjects do not know they are being observed, thus removing an important source of bias. In examination or test settings subjects would easily spot the second presentation.
Clearly, this study has some limitations. There were only 23 general practitioners and only one standardised problem was presented twice, resulting in 46 consultations. However, the few studies set in examination conditions that have used more comparisons have produced ambiguous results. Getting funding for a larger study incorporating more patients and comparisons would be difficult until a pilot study such as this one has been done. Only 32% of the doctors approached agreed to participate without further hesitation, which may mean that the participants reflect a more competent sample of general practitioners.
We believe, however, that our results are valid as the doctors were unaware that they were being assessed. The results show that the assessment of performance was consistent from the first round of consultations to the second round. This means that anyone wanting to give feedback to a group of practitioners on their management of a particular problem would probably need to do only one assessment. However, for assessment of performance of a single physician the results are quite different. We found appreciable intradoctor variation in the management of the two patients. Analysis showed that the personality of the two standardised patients had no effect on the results. A further study using more problems and more presentations of the same problem would give a better indication of whether intradoctor variation is a problem. This may in turn lead to reassessment of the way cases of sampled for examination and licensing of doctors and for quality assessment.
Does the variation matter
A further question is to what extent the intradoctor variation found in this study is a problem? Different scorings (for example a weighted score) may have resulted in different results. The panel which constructed the guideline thought all guideline items were essential and therefore distinguished only between these and non-guideline items. Earlier studies with standardised patients that used more differentiated scores (obligatory, intermediate, and superfluous items) found no differences between these scores.14 Our data should act as a stimulus for careful thinking about differentiated scores of guidelines. Some may argue that only evidence based items are important to record in this type of study, but in general practice this might result in only one or two items per case. All other items are then reflections of the individual performance of a doctor.
To try to find an explanation for the differences in the results of individual general practitioners between the two consultations we carried out some secondary analyses–for example, to determine if there were different outcomes for visits before or after lunch. These analyses all gave negative results. Our data showed two consultations of 40 minutes, which is unusually long. Although we do not know exactly what happened in these consultations, it seems likely that the doctor received a telephone call during these visits. Since these 40 minutes conversations could have a relatively large effect in the inconsistency in the duration of consultation we performed the same calculations for the duration of visits without these consultations and by substituting the 40 minutes by 30 minutes (30 minutes being the second longest consultation). Although the standard deviations were reduced to 5.78 (without these visits) and to 6.99 (for 30 minutes), the conclusions remained the same. We discussed our results with several groups of general practitioners and received reactions such as “this is just real practice and so it should be” or “on Monday after a sleepless night doctors perform differently from Tuesdays after a good rest.”
In conclusion this study shows that intradoctor variation occurs in day to day practice. The implications of this variation remain undetermined, and documentation of what is really going on in doctors' surgeries remains a great challenge.
We thank Arnold Kester (department of biostatistics, University of Limburg) and the General Practitioners Writers Association (in particular Professor Robin Hull) for their help with this paper.
Funding: Norwegian Fund for Quality Assurance (Kvalitetssikringsfondet), grant number 93007.
Conflict of interest: None.