Intended for healthcare professionals

Clinical Review State of the Art Review

The role of pathogen genomics in assessing disease transmission

BMJ 2015; 350 doi: (Published 11 May 2015) Cite this as: BMJ 2015;350:h1314
  1. Vitali Sintchenko, associate professor, director12,
  2. Edward C Holmes, professor13
  1. 1Marie Bashir Institute for Infectious Diseases and Biosecurity and Sydney Medical School, University of Sydney, Sydney, Australia
  2. 2Centre for Infectious Diseases and Microbiology-Public Health, Institute of Clinical Pathology and Medical Research-Pathology West, Westmead Hospital, Sydney, NSW 2145, Australia
  3. 3School of Biological Sciences, Charles Perkins Centre, University of Sydney, Sydney, Australia
  1. Correspondence to: V Sintchenko vitali.sintchenko{at}


Whole genome sequencing (WGS) of pathogens enables the sources and patterns of transmission to be identified during specific disease outbreaks and promises to transform epidemiological research on communicable diseases. This review discusses new insights into disease spread and transmission that have come from the use of WGS, particularly when combined with genomic scale phylogenetic analyses. These include elucidation of the mechanisms of cross species transmission, the potential modes of pathogen transmission, and which people in the population contribute most to transmission. Particular attention is paid to the ability of WGS to resolve individual patient to patient transmission events. Importantly, WGS data seem to be sufficiently discriminatory to target cases linked to community or hospital contacts and hence prevent further spread, and to investigate genetically related cases without a clear epidemiological link. Approaches to combine evidence from epidemiological with genomic sequencing observations are summarised. Ongoing genomic surveillance can identify determinants of transmission, monitor pathogen evolution and adaptation, ensure the accurate and timely diagnosis of infections with epidemic potential, and refine strategies for their control.


  • Backward mutation: Change in a mutated gene that restores the original sequence

  • Branching events: Lineage splitting events that produce two or more separate genotypes

  • Core genome: Genes that are conserved among all strains of a pathogen

  • Dispensable genome: Partially shared or unique strain specific genes

  • Homoplasy: Similarity between sequences that is not due to their shared ancestry, in contrast to homology, which is similarity derived from a common ancestor

  • Host range: Collection of hosts that a pathogen can infect

  • Lateral gene transfer: Exchange of genetic material between bacteria not associated with reproduction

  • Molecular clock dating: Approximation of the dates of branching events using the “molecular clock” hypothesis that changes in the amino acid sequences, which can occur during evolution, take place at a regular rate

  • Nodes: Connection points in a network

  • Pathogen lineage: A subpopulation of microbial species with a defined virulence or ecological niche that differs from other subpopulations

  • Pathogen phylogeography: Spatial distribution of different phylogenetic lineages of pathogens

  • Pathogenicity island: Distinct genetic element on the chromosome of a pathogen that is responsible for its capacity to cause disease

  • Phylogenetic distance: Genetic distance between genomes of related pathogens represented as the number of mutations or evolutionary events on a phylogenetic tree

  • Phylogenetic resolution: Capacity to identify distinct lineages on a phylogenetic tree

  • Phylogenetic tree: Branching diagram that shows the evolutionary inter-relations of a group of pathogens usually derived from a common ancestor

  • Reassortment: Mixing of the genetic material into new combinations in different strains of the same pathogen

  • Recombination: Process of transfer and incorporation of genetic material from a donor cell to a recipient cell to increase diversity and adaptation potential

  • Selective sweeps: The reduction of genetic variation around the mutation site as the result of strong positive selection

  • Statistical machine learning: A field of computer science/artificial intelligence that enables computers to discover new associations between data without being explicitly programmed

  • Super spreaders: A highly infectious person who spreads the pathogen to many other susceptible people

  • Systems science: An interdisciplinary field that studies the nature of systems—from simple to complex—in nature, society, and science itself

  • Variant calling: Identification of sites of difference (such as nucleotide polymorphisms), usually using computational algorithms


Recent advances in nucleic acid sequencing technology have made rapid whole genome sequencing (WGS) of pathogens technically and economically feasible.1 2 3 4 DNA sequencing has advantages over other methods of pathogen identification and characterisation used in microbiology laboratories. Firstly, it provides a universal solution with high throughput, speed, and quality and can be applied to any micro-organism.3 4 Secondly, it produces data that can be compared at national and international levels. Finally, its usefulness has been augmented by the rapid growth of public databases containing reference genomes,2 3 5 6 which can be linked to equivalent databases that contain additional clinical and epidemiological metadata (for example, the influenza research database

The ability to focus on pathogen genomes at the scale of individual outbreaks is a major leap forward in biomedical science.3 7 It follows two previous breakthroughs in the investigation and understanding of communicable diseases. The first was the foundation of spatial epidemiology by John Snow, whose quasi case-control study identified the Broad Street water pump as the source of a cholera outbreak in London in 1854 (fig 1). An array of methods for the spatial and temporal analyses of infectious diseases is now available.


Fig 1 Three major breakthroughs that have enhanced the control of communicable diseases

The second breakthrough was the invention of solid culture media by Robert Koch three decades later, which enabled the identification of numerous bacterial pathogens as agents of human and animal disease. This revolutionised our understanding of the causes and pathogenesis of infectious diseases and facilitated the development of laboratory diagnostics, antimicrobials, and vaccines.

The ability to analyse the genomes of pathogens enables more rapid and precise identification than ever before as well as assessment of their virulence and drug resistance potential. In addition, phylogenetic and related methods of evolutionary analysis can be used to infer the origin and emergence of pathogen. Advances in pathogen genomics open a new frontier in biomedical science by moving the study of disease spread and transmission from the population level to the (individual) patient level, and from the estimation of potential sources of disease to the accurate identification of transmission chains.

Next generation sequencing refers to high throughput sequencing methods that allow the process to be performed in parallel, producing thousands or millions of sequences at once. This encompasses several different technologies that include but are not limited to sequence by synthesis, terminator based sequencing, and ligation based sequencing. The sequence by synthesis approach, notably pyrosequencing, means that nucleotide sequence data are generated during DNA synthesis rather than from the analysis of amplicons after synthesis, as is the case with traditional Sanger sequencing. Another technology based on the sequence by synthesis approach is semiconductor sequencing, which is used by the Ion Torrent system, where parallel sequencing reactions are carried out in 1.2 million microwells on the surface of a semiconductor chip. By contrast, sequencing by Illumina (Solexa) systems relies on reversible dye terminators. DNA molecules are first attached to primers on a slide and amplified so that local clonal colonies are formed. Four types of reversible terminator bases are added and non-incorporated nucleotides are washed away, allowing the fluorescently labelled nucleotides to be captured by a camera.8 More technical details can be found in recent reviews and will not be discussed here.2 4 5 8

This review summarises recent advances in WGS in relation to communicable disease transmission. These developments have the potential to substantially improve the detection and control of disease. Emerging areas of research and clinical research translation are also briefly discussed. The review also evaluates the added value of next generation sequencing in the control of communicable diseases and associated translational research. New insights into mechanisms of spread and transmission of bacterial and viral diseases are also discussed.

Sources and selection criteria

We searched PubMed and Google Scholar from January 2005 to August 2014 using the terms “whole genome sequencing” and “next generation sequencing”, with the filters “infection”, “infectious disease”, “communicable disease”, “transmission”, “spread”, “molecular epidemiology”, and “disease emergence”. The evidence was appraised for validity and relevance to public health practice. Priority was given to human and clinical studies over experimental, infection transmission modelling or animal studies, and to original research about pathogens with epidemic potential over review type publications. We also searched bibliographies of articles for relevant studies and selected translational medicine studies where possible. We also examined abstracts from the Wellcome Trust conference “Applied Bioinformatics and Public Health” (Cambridge, May 2013).

How has clinical medicine benefited from pathogen genomics?

Old and new definitions of pathogen diversity

Genomic studies have shown that microbial populations carry more genetic diversity, often in more complex forms, than previously thought. So how should the diversity of pathogens be defined?

A commonly used term in this context is “strain,” which has been defined as a group of isolates that share a particular set of phenotypic traits, although usually it simply refers to any phylogenetically distinct entity. However, with the increase in resolution provided by WGS, strains can now be further subdivided into genotypes, clones, lineages, and variants.

Classic microbiology has often viewed pathogen diversity in the guise of fixed or static “types,” whereas evolutionary and epidemiological studies depict pathogen genomic variation as a dynamic process,9 10 in which genetic diversity changes in time and space. Given our ability to sequence pathogen genomes in “real time” during disease outbreaks, the dynamic view of pathogen diversity is likely to be more appropriate.

The dynamic view of pathogen genomes is also supported by the growing body of literature showing that pathogen populations can experience rapid and profound changes in genetic structure that reflect and may affect their epidemiology.11 12 For example, selective sweeps (see Glossary) of advantageous mutations, such as those mediating escape from population immunity or those that confer antimicrobial resistance lead to wide ranging (and sometimes genomic scale) reductions in genetic diversity.13 A good example of this process would be the gradual development of drug resistance in Mycobacterium tuberculosis during treatment. In the case of bacteria, population diversity can be represented by the “pathogenome” concept. This concept combines a “core genome,” which depicts a set of genes conserved among all strains, with a “dispensable genome,” which consists of partially shared or unique, strain specific genes (see Glossary).14 The larger the dispensable genome the greater the pathogen’s capacity to survive in hostile ecological niches and to be effectively transmitted to susceptible hosts.15

Genomic differences in pathogen virulence and transmissibility

Evidence suggests that pathogen lineages (see Glossary) may differ in virulence or transmission potential, or both.15 For example, the Beijing lineage of Mycobacterium tuberculosis seems to be more transmissible than other lineages and to be associated with pulmonary tuberculosis in younger patients.16 Although such differences have generally been harder to pin point in viruses,17 partly because of the speed with which genetic diversity is generated in these organisms, some examples have been identified,18 19 20 and more are likely to be found in the future with improved links between genotypic and phenotypic data.

The chikungunya virus is interesting because a single mutation enables human transmission through the highly successful anthropophilic vector Aedes aegypti, thereby increasing epidemic potential.21 Ultimately, analyses of this type may enable a form of “genomic risk assessment” involving the surveillance of mutations that affect virulence, transmission, or both, which in turn may guide intervention strategies. A high profile example is the identification of those mutations that facilitate the human to human transmission of highly pathogenic influenza A H5N1 virus,22 23 and similar risk assessment tools can probably be applied to other emerging pathogens.

Investigating the origin and spread of high impact pathogens

Pathogen genomics has provided new and more detailed explanations for the origin, patterns, and dynamics of spread of several important human pathogens.24 For example, the virtual absence of the meticillin resistant Staphylococcus aureus (MRSA) II sequence type 5 bacteria (which carry staphylococcal toxic shock toxin) from Germany indicated that most of these clinically important micro-organisms originated recently (within the past 15 years) from a very small imported population.25

The origin of and factors responsible for the emergence of the devastating influenza pandemic of 1918-19, as well as its relatively high mortality in young adults are still unclear,26 whereas the origins of the pandemic A/H1N1 2009 influenza outbreak were rapidly reconstructed through genomics.27 Furthermore, the genome-wide analysis of Streptococcus pneumoniae identified 147 genes needed for survival in human saliva and transmission by droplet spread, including those involved in cell envelope synthesis and cell transport.28

On an entirely different time scale, the sequencing of Yersinia pestis bacteria isolated from teeth disinterred from the east Smithfield Black Death burial ground in London dating to 1348 showed that the plague strains associated with this pandemic were similar to those currently circulating, although associated mortality was far greater.29 This is compatible with the idea that the high mortality seen in the medieval plagues (and the sixth century Plague of Justinian30) was more likely to be associated with epidemiological circumstances (overcrowding, poor general health and living conditions) than with bacterial encoded virulence.

New insights into disease spread and transmission provided by genome sequencing

The table lists the main applications of genome sequencing, with specific examples of the added value that it brings to communicable disease control. It shows that studies undertaken so far have focused on infections that are directly transmitted through contact, food, or water. More complex microbial transmission pathways involving zoonotic and environmental reservoirs and intermediate hosts still await systematic examination at the genomic scale.

Applications of pathogen genome sequencing to communicable disease control

View this table:

Cross species transmission and host adaptation

Advances in genomics have been instrumental in identifying the factors that facilitate the successful cross species transmission of emerging pathogens. One of the main observations is that more host adaptive mutations are needed, often in multiple genes, as the phylogenetic distance (see Glossary) between the donor and recipient species increases.100 Hence, despite the rapidity of microbial evolution, it may be a major adaptive challenge to acquire the multiple changes needed to increase host range (see Glossary), and recombination (see Glossary) (or reassortment (see Glossary) in the case of influenza virus) might be a more efficient way to place host adaptive mutations in the same genotype.

Once cross species transmission of a pathogen has occurred various outcomes are possible. These range from infections controlled by a host’s immune system to lethal disease, both of which are likely to result in “spill over” (or dead end) infections with no subsequent transmission, as well as those in which the pathogen is able to evolve sustained (epidemic) transmission in the human population.100 Clearly, to assess the risks of the emergence of new strains and to design effective control strategies, it is important to identify the factors that drive the efficacy of cross species transmission. However, the genomic basis of the pathogen host range has been resolved in only a small number of cases, which indicates that further research is needed.10 17

Work in this area has also confirmed the role of co-infection (simultaneous infection) and superinfection (second infection superimposed on an earlier one) as facilitators of lateral gene transfer (see Glossary).101 For example, individual patients can be colonised with multiple strains of Acinetobacter baumannii, which are then capable of recombining. The movement of patients and staff between healthcare facilities also contributes to strain mixing and diversification,48 and it has been an important factor in the rise of antimicrobial resistance.

Intra-host and inter-host pathogen evolution and transmission

The genetic and phenotypic variation present in bacteria and viruses is generated within individual hosts and, in the case of some bacteria, in the environment. The rate at which genetic and phenotypic diversity is generated is central to understanding the ability of pathogens to adapt to and spread within host populations.78 87 102 This rate depends on four major factors: the population size of the pathogen; its mutation rate; the frequency of replication (assuming that mutations occur during replication); and the fitness of the mutations produced, with advantageous mutations fixed faster than neutral ones.

Replication can occur for extended periods in immunocompromised people, and such people may represent an important reservoir for the emergence of genetically and phenotypically distinct variants.64 103 About one mutation occurs in every replication cycle in RNA viruses, so these viruses are expected to be particularly rapid generators of genetic variation, even though most mutations will be deleterious.10

Despite this propensity to generate genetic variation, population bottlenecks are probably common during inter-host transmission and this will greatly restrict genetic diversity. In some cases, such as certain modes of HIV-1 transmission, infections can be initiated by a single viable virus particle,104 and this will put a strong brake on adaptive evolution at the epidemiological scale, although wider bottlenecks are seen in other viruses such as influenza virus and foot and mouth disease virus.105 106 Also, rates of pathogen evolution may vary within and among hosts, reflecting the different selection pressures in these circumstances. For example, evolutionary rates in HIV-1 are consistently lower at the epidemiological (inter-host) scale than within individual hosts.9 Similarly, although inter-host transmission is often initiated by a randomly generated variant in the donor host, it is possible that the variant transmitted may not be representative of the donor’s viral population or that specific variants may preferentially outgrow in the new host.

Non-invasive bacterial disease and colonisation as enablers of transmission

One of the most striking features of disease transmission is the varying infectiousness of individual hosts, which has a major impact at the epidemiological scale. Indeed, the roles of chance and genetics in the progression from carriage to invasive disease are yet to be fully determined in most cases.107 Recent epidemiological studies using comparative genomics have emphasised the relative contribution of asymptomatic colonisation in disease transmission. For example, comparison of Clostridium difficile genomes suggested that many infections were not caused by recent transmission from a symptomatic person, highlighting the potential importance of asymptomatic carriage or multiple introductions from an environmental source.60 The same is true of many blood borne viral infections, such as HIV, hepatitis B virus, and hepatitis C virus.

Intra-host heterogeneity and modes of transmission

It is also possible that the analysis of intra-host heterogeneity during outbreaks may provide important clues to potential modes of pathogen transmission, especially when microbial cultures of implicated micro-organisms from environmental samples are not available. For example, exposure to a large inoculum of a pathogen (such as uncooked food heavily contaminated with salmonella or hepatitis A virus in raw sewage) is expected to result in productive infection by a large and potentially heterogeneous microbial population.108 109

By contrast, exposure to a relatively small infectious dose of micro-organisms, such as that transmitted through an aerosol or insect vector, would probably lead to infection by a smaller and more homogeneous microbial population.108 109 Thus, the sequencing of the genomes of pathogens obtained from patients during outbreaks may offer additional clues to the precise mode of pathogen transmission, especially when microbial cultures of implicated micro-organisms from environmental samples are not available.

As another example, in a 3.6 year study of C difficile isolates from 1200 patients residing in a defined geographical area, 75% of infections were not transmitted from symptomatic patients, indicating that numerous sources of infection should be targeted to prevent exposure.55 WGS data also seemed to be sufficiently discriminatory to target cases linked to community or hospital contacts and prevent further spread, and to investigate genetically related cases without a clear epidemiological link to uncover novel routes of transmission.55 Clearly, however, additional studies are needed to fully resolve the association between intra-host microbial diversity and mode of transmission, and how much inferential power this provides. WGS is unlikely to be informative in the case of RNA viruses in which genetic diversity accumulates so rapidly.

Case study: the evolution and transmission pathways of hospital acquired MRSA

New methods of genome sequencing have helped answer old questions in the clinical diagnosis and epidemiology of MRSA, one of the most important public health problems in developed countries.38 For example, some studies have used the “molecular clock” analysis of genome sequence data (based on the assumption that nucleotide substitutions accumulate at a constant rate) to estimate the time to the most common recent ancestor (TMCRA) of specific MRSA lineages and hence the dates of presumed transmission events.110 111 Accordingly, the mean substitution rate is estimated at 3.3×10-6 to 7.6×10-5 substitutions per site per year, which corresponds to about one new single nucleotide polymorphism (SNP) in the core MRSA genome every six weeks.62

Figure 2 shows the information that can be derived from analyses of this kind. In this example, the TMCRA for isolates 3 and 4 had to be after infection of patient 3 but before transmission to patient 4. Similarly, because nosocomial MRSA infections probably occur on a time scale of weeks rather than months, direct transmission links can be excluded if the TMCRA of a pair of isolates is estimated to occur over long time scales.58 59


Fig 2 Putative relationships between patients and their pathogen genomes. (A) Each infection is caused by a population of genetic variants within an individual host. Transmission events between patients are indicated by dotted black arrows. A colonising population may evolve between the time of infection and the onset of symptoms (in the same patient), when a strain is usually isolated (blue dot). The time to the most common recent ancestor (TMCRA) of two isolates is shown as a green dot. (B) Most probable transmission pathways—size of the patient reflects the heterogeneity of the population (observed within host diversity based on sampled isolates) and the thickness of the arrow represents the likelihood of the link (inversely proportional to the number of single nucleotide polymorphisms in sequenced genomes)

This approach was verified in investigations of MRSA outbreaks, which confirmed an important role for asymptomatic carriers.59 Although these types of analyses are potentially powerful they make several simplifying assumptions:

  • That a single genome is the founder of each new infection (which is probably true for MRSA)58 59 75

  • No recombination has occurred

  • Neutral evolution (an absence of natural selection on fixed mutations)

  • The rate of nucleotide substitution is constant, particularly in the core genome regions.

Obviously, these assumptions will not be valid in all cases, and this may lead to erroneous estimates. For example, it has been postulated that the evolutionary rate of MRSA differs between patients with systemic infection and asymptomatic carriers. In support of this notion, rapidly evolving “hyper-mutating” MRSA-15 strains have been described during an outbreak in a neonatal intensive care unit.110

WGS has also provided important insights into the origin and regional spread of a healthcare associated epidemic MRSA-15 clone belonging to sequence type 22 (ST22), which was highly transmissible and produced sustained infection.38 Phylogenetic analysis of 193 sequenced isolates showed that the currently circulating MRSA-15 clone is descended from an MRSA epidemic in English hospitals, which emerged from a community associated meticillin sensitive population of S aureus. Modelling of the spread of a fluoroquinolone resistant variant of MRSA ST22 suggested that it originated in the Midlands in the 1980s, and was initially restricted to this geographical region. It then spread rapidly to the north, reaching Scotland, and also to the south, arriving in London five years later.38 The epidemic had spread across the United Kingdom through multiple routes by 2000, and then globally (fig 3).


Fig 3 Visualisation of micro-evolution and regional spread of successful pathogens. (A) Phylogenetic tree based on whole genome sequencing of meticillin resistant Staphylococcus aureus (MRSA) isolates associated with the outbreak in an intensive care unit. Sequences of isolates obtained from 14 patients (P1-14) show low levels of genomic variation (as measured by single nucleotide polymorphisms; SNPs) of MRSA within the outbreak that lasted around 220 days (adapted, with permission, from Harris and colleagues).56 (B) Reconstruction of the spread of sequence type 22 (ST22) in the UK. A continuous spatial diffusion model was used to reconstruct the finer scale geographical dispersal of ST22-A2 within the UK and to predict the origin of fluoroquinolone resistance. Lines indicate the inferred routes of spread with 80% Bayesian credible intervals for the latitude and longitude of spread shown as green ovals. The timing of transmission events isrepresented by red (oldest) or black (more recent) lines and light to dark green oval shading (adapted, with permission, from Holden and colleagues)38

Importantly, the use of MRSA genome sequencing in infection control enables comprehensive and rapid identification of transmission pathways in hospital and community settings.10 53 Its advantages include:

  • Resolution is high enough to assess transmission events indicated by conventional methods of MRSA typing and to identify otherwise unsuspected transmission events58 59

  • Hospital acquired outbreaks can be identified months earlier than when identification is based on epidemiological clustering of cases

  • It can benchmark the accuracy of infection control investigations of MRSA outbreaks

  • It can identify or confirm carriage of MRSA that allows the outbreak to persist.59

In one hospital based study, 26 MRSA isolates were successfully sequenced and analysed within five days of culture, leading to the identification of two outbreaks.53 In both outbreaks most sequences were indistinguishable and the others had only three mutations, while epidemiologically unrelated strains were genetically distinct (>20 SNPs).53 An evaluation of nosocomial transmission of S aureus in an endemic setting of a critical care unit using WGS identified 44 acquisition events, with only a minority explained by patient to patient transmission.59 Finally, genome sequencing has helped to define the transmission of MRSA within hospitals for use in clinical trials (existing definitions relied on the recovery of MRSA from two or more patients within 10 days, two weeks, or a three week transmission period in the same ward).47

Analysis of the spread of pathogens at different scales

Real time outbreak analysis

Bench top sequencers allow a rapid proactive approach to pathogen surveillance that can identify recently accumulated genetic variation. Such studies look at pathogen “microevolution,” which occurs over weeks or months of transmission,112 as opposed to long term “macroevolution,” which reflects the process of microbial speciation that occurs over thousands or millions of years. Because of its short time frame, the analysis of pathogen microevolution requires high resolution genomic data—for example, data that can discern differences in several nucleotides between two bacterial genomes of more than 3 Mb in length. This new capacity to provide (near) real time data on the origin and transmission dynamics of pathogens could provide a major public health benefit.113

The 2014-15 outbreak of Ebola virus in west Africa provides a recent and high profile example. WGS has shown that this outbreak derives from a single transmission from a natural zoonotic (bat) reservoir that probably occurred in early 2014 (fig 4A).71 It is currently unclear whether the genotype of this particular Ebola virus facilitates more extensive human to human transmission or whether it represents an initially early seeding of the virus in urban populations in west Africa. However, the rapid production of genome sequence data will enable this question to be answered more quickly. These data may also provide important information on the major sources of transmission, which will in turn inform disease control strategies.


Fig 4 Global and regional spread of successful epidemic lineages. (A) Phylogenetic tree depicting the ongoing regional spread of Ebola virus in west Africa in 2014, and the molecular clock dating of the time to the most common recent ancestor of the 2014 outbreak (95% credible intervals of 27 January to 14 March 2014) and that of the Sierra Leone viral lineages (95% credible intervals of 2 April to 13 May 2014). Posterior probability distributions of the estimated times to the most common recent ancestor are overlaid below (adapted, with permission, from Gire and colleagues).71 (B) Worldwide spread of Salmonella enterica serovar Kentucky ST198 CIPR clone. The clone originated in Egypt (adapted, with permission, from Le Hello and colleagues)114

Similar real time analysis of the recently emerged Middle East respiratory syndrome coronavirus in Saudi Arabia identified the patterns of spread.36 With respect to bacteria, the study of the global spread of Salmonella enterica serotype Kentucky ST198 clone serves as another example of how the analysis of pathogen genomes provides data of international public health importance.115 In particular, this clone originated in the Middle East (fig 4B) and has been associated with fluoroquinolone resistance.114 115

Analysis of person to person transmission networks

One of the most important benefits stemming from the use of WGS in an epidemiological context is the ability to resolve individual patient to patient transmission events. Recent pathogen transmission between patients in densely sampled outbreaks can be inferred when they cluster together on phylogenetic trees (see Glossary) (if a sufficiently large sample of background lineages has been obtained) or from the genomic distances between genomes of pathogens isolated from them.116 117

Genetic distance between strains increases with time (as the number of individual transmission events also increases).80 This forms the basis of molecular contact tracing: for example, the recovery of identical sequences in different patients is compatible (although not confirmative) of direct transmission between them.

The reconstruction of transmission networks is plausible for Gram positive bacteria with the relative rarity of backward mutations and homoplasy (see Glossary).70 92 96 By contrast, the extreme rapidity of RNA virus evolution potentially allows individual transmission events (who infected whom) to be determined with accuracy, particularly if the intra-host variation (a common output of next generation sequencing at high coverage) can be incorporated into an analysis to improve phylogenetic resolution (see Glossary).117 118

Some studies have defined thresholds for the number of SNPs shared by independent isolates needed to infer involvement in the same transmission cluster.4 45 53 64 However, because the mutation rate of different lineages of the same species may differ,75 and threshold values could be affected by the times of infection and sampling, these “rules” should be used with caution.

Several studies have inferred disease transmission from rich WGS datasets that identify microbial diversity not captured by traditional typing methods.117 118 119 These studies highlight important differences in the interpretation and characteristics of phylogenetic and transmission trees. For example, the timing of nodes (see Glossary) in the transmission tree corresponds to the point of transmission, whereas the timing of nodes in phylogenetic trees of genomic data reflect branching events (see Glossary) that may have taken place before transmission. Hence, intra-host and inter-host evolution change the association between transmission and phylogenetic trees.120 Although Bayesian methods have been developed that jointly estimate transmission and phylogenetic trees and capture within host pathogen dynamics,117 118 119 in many cases there may be too much uncertainty to resolve person to person transmission networks, even with WGS data.


In summary, pathogen genomics provides a new line of evidence that complements existing epidemiological tools in the reconstruction of transmission networks, including the identification of previously unrecognised epidemiological links. Figure 5 shows how genomic data can be mapped to epidemiological curves to make inferences in this area. It combines two ways of organising information from outbreaks: the number of cases over time that satisfy the epidemiological case definition of the outbreak, and phylogenetic trees that show the evolutionary and putative transmission relationships between pathogen genomes obtained from outbreak cases. Through such genomic analyses it is possible to determine the origin of disease outbreaks and transmission dynamics in hospital and community settings; estimate the timing of patient to patient transmission events; and differentiate cases of recurrent or relapsing infection (failure of treatment) from reinfection (failure of public health).44


Fig 5 Synthesis of epidemiological curves with genomic scale phylogenetic trees. Cases associated with a specific outbreak are ordered in the matrix according to the phylogeny of genomes recovered from culture confirmed cases and displayed along the left. (A) The shape of the epidemic curve (histogram) suggests a point source outbreak in which all cases are exposed within one incubation period. (B) The epidemic curve shows an example of a propagated (ongoing) outbreak, in which secondary person to person spread occurs with successive peaks, distanced one incubation period apart. The phylogenetic tree suggests that case X may not be part of this outbreak because it is more closely related to the reference isolate than those from the outbreak cases

Clinical applications of developments in pathogen genomics

Active high resolution public health laboratory surveillance

The effectiveness of public health interventions for the control of communicable disease is limited by the low resolution of current surveillance methods and an incomplete understanding of disease spread, which is largely based on retrospective outbreak investigation.

Importantly, WGS enhanced surveillance allows cases that are misclassified by other surveillance methods to be implicated or ruled out of an outbreak. For example, a comparison of WGS with conventional typing methods in the investigation of an outbreak of Shigella sonnei in the Orthodox Jewish community in the UK showed that the strains originally implicated in the outbreak formed three phylogenetically distinct clusters. One cluster represented cases associated with recent exposure to a single strain, whereas the other two represented distinct (although related) strains of S sonnei circulating in the UK. These observations informed infection control measures within local schools and allowed a stronger public health message to be passed to the local community.82

Genomics enhanced surveillance with radically improved resolution appeals to public health professionals dealing with increasingly complex outbreaks where trace back is complicated and labour intensive.121 Genomics enhanced surveillance relies on the assumption that the epidemiological link between patients can be reliably inferred from the similarity between pathogen genomes. Although this is feasible, particularly when combined with phylogenetic methods that enable transmission pathways to be inferred in detail,116 117 118 119 the identification of epidemiological links is a process of statistical inference, and hence comes with an associated sampling error.

Studies in this area depend heavily on the quality of genome sequencing, assembly, and the choice of reference genomes, but they have been instructive in the deciphering of different outbreaks.6 In addition, frequent recombination can greatly complicate the inference of evolutionary history and estimates of times of origin, so that all such estimates should be treated with caution.96 122 However, when the assumptions are upheld and sampling is representative, phylogenetic analyses help identify the temporal and geographical origin of epidemics and the dominant transmission routes responsible for the global dissemination of pathogens.7 13 This knowledge is important for risk assessment of future pandemic variants, public health interventions, and the design of effective vaccines.

Proactive disease control guided by the identification of transmission pathways

WGS guided surveillance promises the rapid and precise identification of bacterial transmission pathways in hospital49 53 113 and community settings,51 68 70 76 with concomitant reductions in infections, morbidity, and costs.59 91 Because WGS offers unprecedented resolution for determining degrees of relatedness among bacterial and viral isolates, it complements existing epidemiological tools by allowing reconstruction of recent transmission chains and identification of sequential acquisitions and otherwise unrecognised epidemiological links.45 76

For example, investigations of hospital outbreaks of MRSA and C difficile by WGS have allowed discrimination between apparently similar isolates collected within a short time frame.53 110 Recent studies have also shown that WGS can detect super spreaders (see Glossary), predict the existence of undiagnosed cases and intermediates in transmission chains,4 46 48 58 60 suggest likely directionality of transmission, and identify unrecognised risk factors for onward transmission.58 61 Such data can help stop or minimise outbreaks, inform the design and evaluation of intervention programs, and optimise the allocation of public health resources.121

A growing body of statistical methods is aimed at inferring transmission networks and contact structures from pathogen genomic data, with and without contextual epidemiological information.116 117 118 119 120 These methodological developments provide new opportunities for proactive laboratory surveillance and a better understanding of the epidemiology of high burden infectious diseases.3 12 65 122 123

For example, examination of the genomic diversity of Salmonella typhimurium in co-located human and animal populations in the UK showed that a large proportion of transmissions occurred between humans, challenging the current belief that human to human transmission is uncommon in the developed world.92 122 WGS enabled reconstruction of transmission networks also suggested that branched transmissions—where one case causes more than one secondary case—are more common than linear stepwise case to case transmissions.92 It also identified more potential super spreaders than previously assumed.77 78

Genomics based estimation of likely transmission pathways can greatly improve the practice of tracking transmissions,108 and it reduces dependency on epidemiological data, which are more difficult to collect and often incomplete. For example, phylogenetic analysis has been integrated into forensic-style analyses to identify transmission events and the sources of specific outbreaks. Phylogenetic analysis combined with molecular clock dating (see Glossary) of hepatitis C virus was used to reconstruct a large scale outbreak of hepatitis that stemmed from a single anaesthetist in Spain. Notably, dates of infection predicted from the molecular clock analysis correlated well with dates of infection estimated from patient medical records.124

Genomics enhanced clinical risk assessment

The genomes of viruses, bacteria, fungi, and parasites can be rapidly identified and characterised directly in clinical specimens or from laboratory cultures.31 32 WGS of these pathogens can also inform management decisions in complex cases. For example, patients requiring prosthetic devices who are chronically colonised with Shiga-like toxin producing Escherichia coli (STEC) may be denied function saving surgery because of the potential risk for the development of STEC associated haemolytic uraemic syndrome caused by the perioperative use of broad spectrum antimicrobial prophylaxis.125 In such circumstances, analysis of the core and especially dispensable genomes of isolates from individual patients can help ascertain the patient’s risk of progression from long term colonisation to haemolytic uraemic syndrome. For example, patients with an uncommon serotype and sequence type of STEC identified by genetic diversity within the somatic antigen encoding operon, flagellin genes, and MLST genes are at lower risk. The presence or absence of marker genes at the LEE locus or in pathogenicity islands (see Glossary), which are associated with a high virulence of STEC in humans, can also be useful in such assessments.125

Research is accumulating on radically new bioengineering approaches to controlling microbial infections with virulence altering drugs that can disrupt genes and gene complexes.126 Genes and mobile genetic elements can be readily shared among microbial clones, and the identification of genomic markers of drug resistance represents another important element of clinical risk assessment.127

Pathogen genome sequencing

The challenges of data analysis and integration

Although pathogen genomics offers much to medical science, it also presents serious challenges for data analysis, storage, and sharing, as well as the interpretation and management of data by clinicians and public health authorities. The full power of genomic sequence data will not be fully realised until the data are combined with clinical and epidemiological metadata, and the linking of these data types presents several important technical challenges.128

Particular attention has to be paid to the time of the sampling (a patient might have been infected with different strains over time) and the completeness of the sampling frame (undiagnosed cases might be responsible for transmission). In addition, several aspects of quality control are important for the harmonisation of sequencing data analysis. Firstly, the accuracy of identifying variants depends on the depth of sequence coverage. Increased coverage improves variant calling (see Glossary), whereas low coverage increases the risk of missing variants (false negatives) and assigning incorrect allelic states (false positives). Although relatively low sequencing coverage and the analyses of only the core genome might be acceptable for the identification of pathogens, higher coverage is expected for the WGS of bacterial pathogens of public health concern as part of ongoing laboratory surveillance or outbreak investigation.121 129 130 Higher coverage is also needed for the reliable detection of genomic sequences from potentially mixed cultures, including those that contain drug susceptible and resistant versions of the same strain.

Secondly, an increasing number of commercial and “open source” programs with different performance characteristics are available for the mapping and assembly of short reads.131 However, it is not certain that public health laboratories will converge towards a few “validated” pipelines. The (often complex) assembly, alignment, filtering, and SNP calling processes must be fully disclosed for a study to be reproducible and investigation methods comparable between laboratories.

Thirdly, the choice of reference genomes for sequence alignment can greatly affect interpretation.129 Indeed, phylogenetic and other analyses may need to be repeated in light of initial results to select references that provide the highest resolution for an outbreak cluster. Fortunately, several international initiatives support proficiency testing in microbial genome sequencing for public health.122 132

Sharing genome sequence data to “crowd source” epidemic analysis

The speed at which genomic data are being generated and increased access to public databases have aided global responses to newly emerging diseases. Equally importantly, this has shifted the bottleneck in bioscience capacity from the generation of data to its analysis. Major advances in the development and provision of online resources for storing and sharing genome sequencing data (, including rapid publication avenues such as PLoS Currents Outbreaks (, will be of great importance in epidemic situations. Despite this, sequencing data alone may not be sufficient to identify transmission events and pathways accurately, and genomic data should routinely be analysed in conjunction with other types of epidemiological and clinical evidence.7 11 128 133 Such data synthesis will require stronger collaboration between epidemiologists, microbiologists, and specialist bioinformaticians.

System science’s (see Glossary) view of hosts and pathogens and their interactions has been replacing the reductionism that dominated biomedicine for several centuries. Instead, there is a growing awareness that diseases are often caused by multiple pathogens, rather than the paradigm that a single disease is caused by a single microbial species or strain that has dominated microbiology since the time of Robert Koch.

It is also possible that statistical machine learning (see Glossary) and network approaches to studying technological, social, and biological systems and their quantifiable organising principles may offer new opportunities to examine processes of disease transmission.129 130 134 135 Similarly, recent advances in phylogenetic methodology allow increasingly complex characteristics to be inferred from the analysis of pathogen genomes, enable explicit links to be made between pathogen genotype and phenotype, and have greatly improved pathogen phylogeography (see Glossary).128 136 137 138


The availability of bench top WGS analysers has facilitated combined genomic and epidemiological approaches to investigating outbreaks and infection control. Ongoing genomic surveillance can identify determinants of transmission, monitor pathogen evolution and adaptation, ensure accurate diagnosis of infections with epidemic potential, and refine strategies for their control. Crucially, the evolutionary analysis of pathogen genome sequence data allows epidemiological hypotheses to be tested, potentially in real time, and this will greatly enhance the management and control of communicable diseases.

Pathogen genomics has revolutionised our understanding of the mechanisms of disease transmission. We have moved from the view that all forms of disease spread are alike in that they are chiefly mechanical extensions of human contact or of contact between a sick animal and a human body. Instead, we now understand that transmission is an evolutionary event that can have a major impact on the extent and structure of genetic diversity as it flows through the population. Overall, the insights provided by pathogen genomics have fundamentally changed our collective understanding of long term global and short term local spread of communicable diseases and have opened up potential new strategies of disease control and prevention.

Ongoing research questions

  • How does identification of the factors that drive cross species transmission and establishment in new hosts influence the emergence of new pathogens and the design of effective control strategies?

  • Does phylogenetics have sufficient resolution to accurately reconstruct transmission pathways and to date transmission events within community outbreaks and the transmission of newly emerged pathogens?

  • Can phylogenetic and molecular clock dating studies discover unsuspected epidemiological links and identify the time scales of disease outbreaks?

  • Is the extent of intra-host diversity in microbial populations affected by modes of transmission?

  • How can next generation sequencing data further enhance models that can infer the environmental source of food borne bacteria (that is, source attribution models for foodborne diseases)?

  • What are the most appropriate models and techniques to infer patient to patient transmission networks from whole genome sequencing data?

  • What are the ethical and medico-legal implications of identifying sources of disease outbreaks among asymptomatic carriers and healthcare workers?


Cite this as: BMJ 2015;350:h1314


  • Both authors were supported by the Australian National Health and Medical Research Council.

  • Contributors: Both authors had full access to the content of this review, wrote the manuscript, and are guarantors.

  • Competing interests: We have read and understood BMJ policy on declaration of interests and declare the following interests: none.

  • Provenance and peer review: Commissioned; externally peer reviewed.


View Abstract