Pathogen discovery through deep sequencing

By: John Brunstein   

Immunocompetent humans can potentially be infected by a large number of bacterial or viral pathogen species. Clinicians know, through long association with most of these common pathogenic organisms, their various associated details of infection route(s), symptoms, disease courses, and treatments. “Common,” in fact, isn’t really the right adjective in the preceding sentence, as there are many pathogenic organisms that occur quite infrequently in human disease but are still described in the medical literature.

In spite of this depth of knowledge, novel pathogens (or possible pathogens, as we’ll see below) continue to be found in the context of human disease. Our topic this month is where these arise from, and how the molecular diagnostic approach of deep sequencing can help in speeding up our identification of as-yet unknown organisms.

Discovering novel pathogens

The discovery of a novel pathogen often occurs in one of two scenarios. The first is the known or suspected “jump across the species barrier.” An established pathogen in one species gains a capacity to productively infect humans—that is, to infect, replicate to produce viable infectious progeny, and have a viable mode of onward transmission. For this jump to occur requires both opportunity in the form of humans in transmission proximity to the original pathogen and suitable genetic diversity in the pathogenic species, so that a small fraction of its progeny carry genetic changes, allowing for the successful adaptation to a new host.

Not surprisingly, many pathogens appearing through this scenario are RNA viruses with arthropod vectors. RNA replication is intrinsically error-prone, giving rise to the concept of a “quasi-species swarm” of many small variations of the infecting sequence. That means there’s a fertile testbed of genetic variants from which to draw adaptation, and the arthropod, such as a mosquito, provides the proximity by biting both host and human, thus serving as vector. Introduction of novel pathogens to humans by this route has been rationally proposed to be accelerated by human ingress to previously remote areas which harbor pathogen and host populations to which we as a species have not had significant prior exposure.

While the number of viruses available to make such a species jump is unknown, studies such as that by Anthony and coworkers1 provide rough estimates on the order of 320,000 currently circulating viral species in worldwide mammalian populations alone. While estimates such as this are fraught with uncertainties, the scale of the estimates, coupled with the realization that non-mammalian hosts (such as avian) are also possible, makes it abundantly clear that the total pool from which to have statistically rare events occur is very large. Such emergences of novel pathogens should therefore be expected to occur for the foreseeable future, and are a particular cause for concern as they enter a human host population with no herd immunity. Readers may think, for example, of SARS coronavirus as one such instance.

A second scenario of novel pathogen discovery is the case of orphan diseases, those which occur frequently enough and with distinct enough presentation to be identified as uniform diseases, but with as-yet unknown etiology. Application of discovery techniques may in these cases help to identify statistical correlation to the presence of novel—or, at least, novel in context—pathogens. This case differs from our first scenario in that widespread (and presumably, long-term) human exposure has already occurred, and herd immunity is likely even if the pathogen is previously unidentified.

Consider then either the appearance of an infectious disease outbreak of consistent presentation, which fails to diagnose as a known pathogen, or the context of an orphan disease which is suspected to have an infectious component. What molecular diagnostics technique shows the most promise in helping detect the presence and identity of an infectious agent in these cases? The answer is, deep sequencing.

Applications of deep sequencing

Deep sequencing can be performed on any of a number of different platforms. Readers may recall from earlier Primer installments that “depth” is a concept applied to next generation sequencing (NGS) applications, and here it refers to the number of replicate reads per input sequence material. While specifics differ among platforms, NGS methods as a whole work by fragmenting an input total nucleic acid sample and then running enormous numbers of individual short sequence reading reactions in parallel. The choice of template material (that is, the nucleic acid being read out) in each individual micro-reaction is essentially stochastic; more highly abundant sequences in the sample yield more templates being read out, and thus more data replicates. Conversely, low abundance template sequences are sampled rarely, and yield much lower numbers of sequence reads in the ensuing data set.

The concept of depth becomes important here because it’s quite possible that an etiologic (or at least, significantly associated) pathogen may occur at quite low abundance in a sample, compared to host nucleic acids. This could be true even if the extraneous agent is at high copy numbers in particular infected cells; we are dealing with an unknown here and may not be selectively capturing the appropriate cells in our sample collection. We would therefore like to run this NGS process exhaustively, using as a benchmark something like the number of times a single copy human genomic marker such as a particular gene is visible in the data. To make the point, in an extreme example, if this host genome marker only shows up a couple of times in the entire NGS data set, then the probability that our data set also successfully captured some uncommon mystery pathogen is vanishingly small. The point of deep sequencing is to go well past even the usual NGS depths (which can be roughly on the order of 30 replicates observed for known single copy targets); deep and ultra-deep sequencing can push this number to 100 or higher. Doing so can greatly improve our chances for detecting low-abundance suspects.

We proceed, then, by taking a sample of interest (associated with our mystery presentation or orphan disease) and subjecting it to deep or ultra-deep sequencing. The result—after appropriate wet lab and bioinformatic steps—is a huge amount of sequence data. Most of that data is, of course, derived from the host genome, and thus is of no interest to us in this application; we therefore proceed to remove this from consideration. This is done by further bioinformatic methods; essentially, each individual sequence is checked against known human sequences, and if it’s a close match, it’s removed from further consideration. (Why just a close match, and not identical? Individual genomes will vary slightly from any reference genome they are searched against, so we accept some variation as still within bounds of identifying a read as human genomic in origin). What this process does, in effect, is filter the massive set of data and screen out any sequences without significant similarity to host sequences. By extension, we now assume that some extraneous organism—pathogenic, commensal, or sample contaminant in nature—is the source of the non-host nucleic acid.

Of course, we’ve expressed above that these non-host sequences might be fairly rare, so that even in our deeply sequenced sample, they show up only a few times. How do we know this is significant? The answer is, we really don’t until we have more numbers. Ideally, you’d like to sample many unrelated patient samples with the same mystery condition or orphan disease, and find the same (or very similar) non-host sequence appearing in a large proportion of the samples. At this point, we at least have evidence for association between the non-host sequence and the condition.

We can then consider what the non-host sequence derives from. This is also most easily done by simple bioinformatic searches against known organisms. While we might have initially been excited to find that a particular non-host sequence was repeated in a large number of our samples, if a search shows these to all arise from S. epidermidis and our original samples were all venipuncture peripheral blood, we might be less excited, as the sequence could readily be a contaminant. The data gets more exciting when a truly novel sequence element is found in association to a particular presentation, or, at least, one that is novel in that context (such as deriving from a known pathogen, but one not readily explained as a likely contaminant, and not otherwise associated with the condition). At this point, we may be on to something in our search.

The payoff—possibly

Suitably deep sequencing is quite expensive, both in direct costs and time and labor; however, the good news is that if we have done this on a set of samples and gotten an interesting lead, we can now switch from a search for the unknown to a much cheaper, faster, directed search for the known. In this case, “known” may only refer to segment of non-host nucleic acid, but that’s enough to design target-specific polymerase chain reaction (PCR) primers against. We can now go out and cheaply screen very large sample sets explicitly for this target: does it occur in association with our condition, to acceptable statistical rigor?

We can also apply bioinformatic analysis to the interesting non-host sequence. Even if it’s a truly novel organism, chances are it shares detectable genetic similarity with known organisms. This sort of analysis can quickly help place our unknown in a family context. If this sort of data is available, it can help add circumstantial evidence to the significance of the finding; for example, if our mystery disease was acute respiratory in nature, and we recover a sequence that shows a “family resemblance” to known coronaviruses, the likelihood that we’re on to something meaningful improves.

From this stage onward, more classical methods of determining causal linkage need to be applied. While additional purely associative data is helpful, fulfillment of Koch’s Postulate (the four criteria designed to establish a causative relationship between a microbe and a disease) either in humans or in an animal model would be the preferred method to bring our voyage of pathogen discovery to a close. Hopefully, it will have led to a better understanding of the causes and thus prevention and/or treatment for our mystery or (ex) orphan disease.

Finally, note that our discussion has used the generic term “nucleic acids” throughout. Pathogen discovery though deep sequencing is amenable to either DNA or RNA targets, as long as suitable sample handling and pre-processing steps (such as random cDNA generation) are applied. The method is potentially very powerful, and we should expect to see it applied with increasing frequency in elucidating infectious disease contributions to both novel emerging diseases and orphan diseases.


  1. Anthony SJ, Epstein JH, Murray KA, et al. A strategy to estimate unknown viral diversity in mammals. mBio. e00598-13, 2013,



John Brunstein, PhD, is a member of the MLO  Editorial Advisory Board. He serves as President and Chief Science Officer for British Columbia-based PathoID, Inc., which provides consulting for development and validation of molecular assays.