Review of instruments for peer assessment of physiciansBMJ 2004; 328 doi: http://dx.doi.org/10.1136/bmj.328.7450.1240 (Published 20 May 2004) Cite this as: BMJ 2004;328:1240
- Correspondence to: R Evans
- Accepted 30 January 2004
Objectives To identify existing instruments for rating peers (professional colleagues) in medical practice and to evaluate them in terms of how they have been developed, their validity and reliability, and their appropriateness for use in clinical settings, including primary care.
Design Systematic literature review.
Data sources Electronic search techniques, snowball sampling, and correspondence with specialists.
Study selection The peer assessment instruments identified were evaluated in terms of how they were developed and to what extent, if relevant, their psychometric properties had been determined.
Results A search of six electronic databases identified 4566 possible articles. After appraisal of the abstracts and in depth assessment of 42 articles, three rating scales fulfilled the inclusion criteria and were fully appraised. The three instruments did not meet established standards of instrument development, as no reference was made to a theoretical framework and the published psychometric data omitted essential work on construct and criterion validity. Rater training was absent, and guidance consisted of short written instructions. Two instruments were developed for a hospital setting in the United States and one for a primary care setting in Canada.
Conclusions The instruments developed to date for physicians to evaluate characteristics of colleagues need further assessment of validity before their widespread use is merited.
It is no longer enough to do a job to the best of one's ability. Other people have to be assured that professionals can be trusted, and interest is growing in the concept that colleagues might be well placed to make these judgments. We live in a society in which we are held to account for our performance, especially if we perform professional functions, and doubly so if these are funded for the common good, as is the case in medicine and education.1 Interest therefore exists in how to measure the performance of doctors and other healthcare professionals. The recent focus on appraisal systems, recertification, revalidation, and continuous professional development bears witness to the interest in how to assess the ability of clinicians to maintain and sustain their competence and to exhibit the qualities deemed to be necessary in their professional role.
Imbalances in knowledge between lay people and professionals make it difficult for lay people to assess doctors' ability and competence. Thus the idea of asking peers to assess professional performance, particularly the humanistic non-cognitive aspects (for example, qualities such as integrity, compassion, and responsibility to others) that are less accessible to conventional means of assessment such as written and clinical examinations, has been increasingly explored in the literature.2–5 But doubts remain about the validity of peer ratings of these aspects,5–7 where “high reliability, together with greater ease of use, may distract from concerns about validity when considering peer ratings as a measure of actual quality.”8
The need for a measure of the humanistic aspects of a physician's practice is being increasingly accepted, and uncertainty about the validity of using one's peers for this measure suggests the need for some systematic evaluation of the current situation. We aimed to identify all existing instruments for rating peers in medical practice and to evaluate them in terms of how they were developed, their validity and reliability, and their appropriateness for use in clinical practice, including the primary care setting.
We did a systematic search for references to instruments for the rating of physicians by peers in the world literature, not limited to English. Preliminary searches suggested that this is an emerging area that does not have an extensive literature and is likely to be poorly indexed on electronic databases. We therefore designed a broad but systematic search process. The search strategy included keyword combinations—for example, physician review, peer evaluation, colleague assessment. The combination of “peer” and “review” gave too large a false positive hit rate referring to peer review of published literature (for example, 9000 on Medline alone). We searched the following databases: Medline, 1966 to present; Embase, 1980 to present; PsycINFO, 1972 to present; ASSIA for Health; CINAHL; and the Cochrane Database of Systematic Reviews. Two reviewers independently scrutinised all citations and abstracts (all databases by RE, three databases each by GE and AE); any disagreements were resolved by discussion or reference to a third research colleague. We appraised the references of identified relevant papers and review articles, and we contacted key authors identified frequently from electronic searches.
We included instruments if they had been specifically developed for use by physicians for the review or assessment of a peer or colleague in practice. We required articles to have some data on either the way the instruments were developed or their validation using psychometric methods. We excluded instruments if they were designed primarily for self completion, those not completed by physician peers, and instruments that were designed for use in purely educational settings.
We examined each instrument identified in terms of its purpose (explicit or implicit), whether it had a developmental aim (formative), or whether the assessment was intended to identify a standard or judgment (summative). We assessed the theoretical underpinning, if specified, and how the tool had been developed and evaluated. We compared the samples, including the ratio of peers to index physician, and the total number of questionnaires considered in the psychometric analyses. We examined the method of identifying peers, how anonymity (or otherwise) of ratings was managed, the existence of benchmarks, and whether instruction or training was provided for the raters.
Two reviewers extracted data on to a template covering country of origin, physician group (secondary or primary care, generalist or specialist), sample size, quality of methods, purpose of rating, response rate, nature of psychometric analysis, developmental pathway (nature and quality of qualitative research informing the individual items on the scale), relevance to primary care, and relevance to the United Kingdom.
The search identified 4566 articles (Medline 1087, Embase 795, PsycINFO 2258, Cochrane 239, CINAHL 144, ASSIA for Health 43). A total of 42 articles were identified (on the basis of a reading of the titles and abstracts) by at least one of two reviewers, and we obtained full papers for all of these. We found two papers to be irrelevant. Thirteen papers were of background interest on performance of physicians. Nine papers were from the literature on peer appraisal outside health care. Eighteen papers related to instruments for the rating of physicians by peers. Of these 18 papers, five related to instruments that failed to meet our inclusion criteria (see bmj.com),6 9–12 eight papers related to the professional associate rating developed by Ramsey for the American Board of Internal Medicine,3 5 13–18 one separate instrument based on the American board recommendations was from a research group independent of Ramsey,19 and four papers related to the Canadian instrument, the peer assessment questionnaire.2 20–22 All studies were on voluntary participants.
We included three instruments: the professional associate rating,16 17 the peer assessment questionnaire,2 21 and the peer review evaluation form.19 All three were developed in North America or Canada; we identified no equivalent instruments from the United Kingdom.
Professional associate rating—This is a questionnaire from the United States that consists of a scale for rating fellow physicians on a range of parameters based on American Board of Internal Medicine recommendations and encompassing clinical competence, communication skills, and humanistic qualities. The board uses the professional associate rating as part of its continuous professional development programme.23 This programme has three components: self evaluation, a secure examination (single best answer questions), and verification of credentials. The self evaluation component includes an elective “patient and peer assessment module,” which includes the professional associate rating and also patient ratings, self ratings, and a quality improvement plan. The professional associate rating instrument derives from the work of Ramsey and colleagues,3 4 13–17 who developed the scale in response to concerns about the inability of the certification board examinations to assess the full spectrum of physicians' competence, in particular the humanistic qualities, professionalism, and communication skills.17 The implied purpose of the measure initially was to “identify outlying physicians” (that is, a measure of performance).3 Yet the later literature is more explicit that the purpose was formative and to “stimulate self-reflection… not to identify ’problem physicians'.”17 The American board also noted the need for further research before feedback to participants and stated that the link, if any, between feedback and improved performance needs further research.
Peer assessment questionnaire—This Canadian instrument (developed by Violato and colleagues) uses a rating scale covering the dimensions of clinical competency, professional management, humanistic communication, and psychosocial management.2 21 This was used with other instruments to produce multisource assessment known as 360 degree feedback, including patients, coworkers (non-physicians), and self. The development of the questionnaire was based on a grid of competences derived from a professional committee, with further development through focus groups. This was reported as clarification of wording and deletion of inappropriate items, but assessment of construct validity was not reported.
Peer review evaluation form—This instrument from the United States (developed by Thomas) consists of a scale for rating along dimensions derived from the American Board of Internal Medicine recommendations, including technical skills (obtaining history, examining, investigating) and interpersonal skills (demonstrating integrity, empathy, and compassion).19 The authors noted that specific training in the use of such an instrument would require residents and faculty to mutually define terms such as integrity, empathy, and compassion. It is also worth noting that linking multiple assessment factors such as integrity, empathy, and compassion in a single item poses potential dilemmas for the rater.
In summary, some instruments seem to be described in the literature, but only three have psychometric data about either their development or their validity and reliability. The table shows the essential characteristics of these three instruments (shown in more detail in tables A and B on bmj.com). None of the identified instruments refers to a theoretical framework. Other than factor analysis performed on the empirical results, little other psychometric assessment has been undertaken, and an important omission is the lack of attention given to construct and criterion validity.24 Explicit purposes of the instruments are either unmentioned or evolve over time. For the professional associate rating, a generalisability coefficient of 0.7 is quoted, suggesting good levels of reliability, whereas standard psychometric texts recommend coefficients of 0.75 as “a fairly minimal requirement for a useful instrument.”24 Moreover, concentrating on reliability and feasibility is premature when concerns exist regarding validity. The developers of none of the three instruments examined had addressed how to guide or train the assessors, other than by providing written instructions.
Considerable interest exists in the concept that physicians can assess each other across a range of qualities (for example, integrity, compassion, respect, and responsibility), but this review shows that the instruments developed for peer assessment have not been developed in accordance with best practice. The principles of instrument design involve giving attention to theoretical frameworks and construct clarification in order to establish validity as the basis for reliability studies. These steps are not described for the instruments we identified.
As far as we are aware this is the first review of instruments for peer appraisal of practising physicians. We followed standard methods, with reliance on key author contacts and secondary references. The emerging nature of the field limits the effectiveness of the database searches.
Caution is needed when developing quantitative measures that use peers to rate complex humanistic qualities, and the complex nature of this field should be acknowledged.25 A common theme in the assessment literature is the question of self evaluation versus external evaluation and whether “others” can form judgments on differing facets of professional practice.26 “Social comparison theory” acknowledges the drive to self evaluate, using similar others as a benchmark,27–29 and recognises the construct of “managerial self awareness” as a process of self reflection using feedback,30 allowing us to “see as others see.”26 The validity of “others” as appropriate rater groups remains a challenge for research, because criteria and frames of reference, even if defined explicitly, will vary with each individual.6 11 18 In other words, how many “true” peers do professionals have? How many peer colleagues are in positions of having accurate knowledge about an individual's performance in terms of compassion, responsibility, or respect, so that they can make informed judgments?
The wider literature also draws a distinction between “task performance” versus “contextual performance.”31 This dichotomy seems to parallel the distinction between performance ratings (as task) and 360 degree or multisource feedback (as context); thus one author feels that multisource feedback and performance ratings may be separate constructs.32 The point here is that the instruments developed for peer rating of physicians have not explicitly allowed for the distinction that can be identified between “task” and “contextual” performance and their effects on ratings.
The other key issue is the perceived fairness of the peer appraisal process. Procedural justice theory suggests that people naturally make judgments on how decisions are arrived at (procedural justice) quite separately from judgments on outcomes of decisions (distributive justice).33 Procedural justice is seen to be more important than outcome in terms of overall acceptability and an essential element of validity. This initial judgment on fairness also sets a frame of reference for interpreting subsequent events that has a crucial and enduring influence.33 Doubts about the face validity of arriving at a judgment on a peer's compassion or integrity risk jeopardising the peer appraisal process through negative perceptions, which could be difficult to overcome subsequently. Evidence seems to exist that an appraisal process, once underway, enters a feedback loop of success that quickly becomes positive or negative with no safe middle ground.25 The identified instruments would need to consider procedural justice by demonstrating their validity through clearly defined criteria and constructs relevant to the rater groups.
Face validity—indicates whether an instrument “seems” to either the users or designers to be assessing the correct qualities. It is essentially a subjective judgment.
Content validity—a judgment by one or more “experts” as to whether the instrument samples the relevant or important “content” or “domains” within the concept to be measured. An explicit statement by an expert panel should be a minimum requirement for any instrument. However, to ensure that the instrument is measuring what is intended, methods that go beyond peer judgments are usually needed.
Criterion validity—usually defined as the correlation of a scale with some other measures of the trait or disorder under study (ideally a “gold standard” in the field).
Construct validity—refers to the ability of the instrument to measure the “hypothetical construct,” which is at the heart of what is being measured. Where a gold standard does not exist (as is the case for measuring humanistic qualities such as compassion, integrity, responsibility, and respect), construct validity is determined by designing experiments that explore the ability of the instrument to “measure” the construct in question. This is often done by applying the scale to different populations, which are known to have differing amounts of the property to be assessed. By conducting a series of converging studies, the construct validity of the new instrument can be determined.
Internal consistency—assumes that the instrument is assessing one dimension or concept and that the scores in individual items would be correlated with scores in all other items. These correlations are usually calculated by comparing items.
Stability—an assessment of the ability of the instrument to produce similar results when used by different observers (inter-rater reliability) or by the same observer on different occasions (intrarater reliability). Test-retest reliability assesses whether the instrument produces the same result if used on the same sample on two separate occasions.
Where measurements are undertaken in complex interactions by multiple raters, the production of reliability coefficients by using generalisability theory is advocated.
What is already known on this topic
The range of professional competences and qualities now recognised as necessary in a good physician is not adequately assessed by conventional examinations and assessments
Suitable methods are needed to assess the broader range of competences, including “humanistic” qualities and professionalism
Peers are one potential source of assessment of these aspects of physicians' practice
What this study adds
Very few instruments designed for peer assessment of physicians exist, and their development so far has focused on reliability and feasibility
The available instruments lack theoretical frameworks, and their validity remains questionable
Clarity of purpose is a key determinant of the subsequent “success” of peer appraisal but may be lost by confounding summative and formative aims
Concern has been voiced about the validity of peer evaluation. If the validity of peer ratings remains unclear, then, as Saturno reminds us, reliability and feasibility are no substitute.3 8 15 16 A possible approach has been initiated in Finland, where qualitative methods have been used to begin to characterise some of the concepts and constructs relevant to peer appraisal that are needed before quantitative tools are developed.34 When peers have attempted to rate humanistic qualities, the validity has not been well supported by empirical findings. The poor agreement between observers of the same events is shown by several studies.5–8 35 An argument is emerging that the most valid source of ratings for humanistic dimensions are patients,5 6 10 30 because only they have experienced certain qualities, such as “a level of intimacy,” not available to other raters such as peers.30
Implications for policy
Quality “improvement” using formative developmental appraisal and quality “assurance” using methods to identify underperformance are separate aims. The importance of being clear about the purpose has been emphasised repeatedly.25 31 Combining these separate aims may compromise such clarity. This problem seems to have confounded the development of peer assessment methods. Peers may in effect be asked to make two judgments at once, one on “quality” and one on “adequacy for purpose.” Making a judgment on adequacy presupposes knowledge about acceptable ranges for the criteria, which must be defined.35 The validity of rating items in assessing aspects such as the compassion, integrity, respect, or responsibility of a peer remains highly suspect. To have any validity or reliability, such qualities would need to be expressed as observable behaviours. In the absence of clearly defined constructs derived from a bottom-up empirical approach, and lacking a coherent theoretical framework, what is being measured here, if anything, is unclear.
Implications for research
Interest exists in using peers to assess the humanistic qualities of physicians, but the theoretical underpinning is lacking. Clarity of purpose is vital, and more attention needs to be given to the underlying constructs of interest. That judgments can be made only by those people who experience the qualities in question must be recognised. In the meantime, peer assessment methods should be used with caution.
Extra tables and details of excluded instruments are on bmj.com
We acknowledge the support of the Department of Postgraduate Education for General Practice, University of Wales Cardiff.
Contributors RE did the literature review and drafted the article. GE and AE instigated the study, took part in the review, and contributed to the writing.
Competing interests None declared.