Next generation sequencing and the search for emerging or unrecognized pathogens

Dec. 14, 2014

“The single biggest threat to man’s continued dominance on this planet is the virus.” The late Nobel Prize-winning molecular biologist Joshua Lederberg famously said that; and it seems particularly significant at the time of this writing, only days after the first recorded case of Ebola virus transmission within North America. While Ebola specifically would seem highly unlikely to have significant penetration in countries with good medical and infection control systems, the truth remains that readily transmissible infectious agents, bacterial or viral in nature, have the potential to create serious harm to human health. This is particularly true in the case of novel or emerging pathogens, ones for which no effective treatments have been established and, worse, for which no herd immunity exists. That such novel agents will continue to be encountered seems highly likely, given factors such as increasing human population densities, increased large-scale encroachment by humans into prime geographic areas for arboviral vectors, and the ease and speed of world travel. The net impacts and risks from all of this have been dealt with by numerous authors and thus shall not be the direct focus of this month’s column. Rather, we’ll continue our examination of next generation sequencing (NGS) clinical applications, with a view of how NGS can be applied to pathogen discovery both in the case of novel emerging pathogens, and in uncovering correlations—possibly causal ones—between known pathogens and diseases of unknown etiology. 

Idiopathic diseases and close relatives

The most obvious similarity between these two situations arises because both give rise to idiopathic diseases—that is, ones without a known cause. A search for “idiopathic disease” will turn up long lists of conditions grouped symptomatically as possibly being from a single unidentified cause; infectious agents are one possible suspect in the etiology, although perhaps not as immediate a guess as in, for example, the case of an explorer returning from some remote insect-infested jungle with symptoms of a hemorrhagic fever. In either case, the first steps towards specific (as opposed to purely symptomatic) treatment lie in the identification of the causative agent. 

For cases in which the condition has obvious similarities to a known infectious disease, low-specificity molecular techniques aimed at detecting “close relatives” of known agents based on genetic similarities have proven successful. These methods have included approaches such as degenerate PCR (basically, broad consensus primers to a pathogen family, run under conditions which allow amplification of partial primer mismatches) or large-scale array methods in which the array probes cover significant “sequence space” around known sequences. By their nature, however, these approaches can only help in the identification of pathogens with significant genetic similarity to known ones. How then can one go about attempting to identify whether a truly novel pathogen is associated with some presentation, in an unbiased manner? 

One powerful answer to this question lies in current NGS techniques, when coupled with effective ways to identify and access suitably sized numbers of samples well characterized as positive or negative for the underlying condition in question. Ideally, the condition should have a unique enough presentation to not appreciably contaminate the presentation-based “positive samples” set with similar-appearing diseases from other causes. At the same time, the pathogen must have a low rate of generating uncharacteristic or asymptomatic infections, which would otherwise potentially give rise to its appearance in the presentation-based “negative” or control samples. That an unknown pathogen will meet both of these requirements is far from assured; if it does, however, the process is relatively straightforward.

Positive and negative samples

The “presentation positive” and “presentation negative” or control sets of samples are each individually subjected to some form of NGS. Either an RNA-based expression library or a DNA-based library may be employed, although RNA-based approaches would seem more promising a priori. (RNA viruses may not ever exist in a DNA state, and since our downstream bioinformatics will be used to sift out and discard host genetic sequences, occupying many of our sequence reads with noncoding human DNA would seem inefficient.) One important decision which must be made at this juncture is what sample type or specimen material to use in preparing the libraries; for lack of better knowledge, some specimen types thought to represent the tissue(s) or organ(s) impacted by the presentation are likely selected as most promising material to examine. A fairly heavy depth of sequencing is required, as pathogen sequences may be a rather small fraction of the total input nucleic acids analyzed. Once sequence reads are obtained and tiled, they are bioinformatically screened against a reference human genome, and the obviously human sequences are removed from consideration. What’s left, in theory, is a collection of non-host (non-human) nucleic acid sequences which were present in our sample.

Further analysis is now a question of statistical inference and correlative evidence. In an ideal scenario, such an experiment would uncover novel genetic sequences recognizable through distant homologies as related to a known bacterial or viral species, present in a high proportion of the “presentation positive” samples and few or none of the “presentation negative” samples. Such evidence, while exciting, is a long way from being anything other than correlative, however. An immediate next step would possibly be to develop a targeted PCR (or RT-PCR, as the genetic material dictates) assay for the novel sequence(s), allowing for rapid and cheap screening of a much larger and thus more statistically relevant set of paired “positive” and “negative/control population” pools. If the results from this larger sample study still bear out the observed statistically relevant correlation between presentation positivity and novel pathogen presence, then the case for suspicion of a causal link is improved. 

Causation and correlation

Such data is a far cry from fulfilling Koch’s Postulate: demonstrating induction of the presentation following intentional controlled infection with the agent. But as it is augmented with an understanding of the biology of the pathogen, our determination of whether the correlation is a causal one (the pathogen causes the presentation) or an opportunistic one (the condition is still idiopathic, and creates a selective environment for the organism) can become more solidly based. A classic example of this remains the association of human papillomaviruses (HPV) with cervical cancers; while, strictly speaking, the data for this remains correlative, the very near 100% concordance of viral presence in cancer cases, and our knowledge that the viruses are transforming in cell culture, work together to give us a very high degree of belief that the association is a causal one. In this particular example, future data on cervical cancer rates in HPV-vaccinated versus unvaccinated populations will provide more evidence as to the nature of this association. 

The importance of understanding that the data is correlative is highlighted by a recent paper1 whose authors report on the origin of a novel parvovirus-like organism, “parvovirus-like hybrid virus,” ( PHV) detected through NGS screening of clinical samples. Detection of what was essentially a single PHV sequence across a wide range of clinical specimens was one observation which raised a warning flag. Further careful study demonstrated the sequences to be real and not a laboratory contaminant per se, with source eventually traced to residual nucleic acids in the silica matrix of commercial spin columns being used to purify nucleic acids from the clinical samples. This silica in turn is derived from marine diatoms, and metagenomic analysis of the PHV sequence demonstrated that it is detectable in marine water samples along the coast of North America, in the end providing an intellectually satisfying closure to how and where this viral sequence came to be represented in the NGS libraries examined. 

Despite the challenges—possible admixture of underlying causes in “positive” sample pools, possible asymptomatic cases in the “negative” pools, data of a correlative but not strictly causal link—the methods outlined here have been applied with multiple successes in recent years. Examples in which NGS strategies have identified likely causal agents in idiopathic conditions (not all in humans) include Heartland Bunyavirus, titi monkey adenovirus, Bas-Congo virus, Theiler’s disease-associated virus, Lujo arenavirus, and many others.2 As prices for NGS drop and instrumentation becomes available to more clinical researchers, this will become an increasingly attractive approach for at least pilot studies on many idiopathic conditions. Choice of effective positive and negative pool selection strategies and detailed, systematic follow-up on initial evidence of correlation will be the critical factors in using this tool to begin identifying causal agents in some of the long list of idiopathic illnesses as a critical first step to more than symptomatic treatment.

John Brunstein, PhD, a member of the MLO Editorial Advisory Board, is President and CSO of British Columbia-based PathoID, Inc.


  1. Naccache SN, Greninger AL, Lee D, et al. The perils of pathogen discovery: origin of a novel parvovirus-like hybrid genome traced to nucleic acid extraction spin columns. J Virol. 2013:87(22):11966-11977.
  2. Chiu, CY. Viral pathogen discovery. Curr Opin Microbiol. 2013;16(4):468-478.