Metagenomic surveillance for emerging diseases: An idea whose time has come?

Dec. 21, 2022

At their root, human beings seem an eternally optimistic lot. While historians, epidemiologists, and to some extent the rest of us are at least vaguely aware of such things as the Black Death (Europe and North Africa, 1346–1353, with sporadic regional recurrences up to 1665/66 in London); the Spanish Flu (worldwide, 1918–1920) and subsequent Influenza A pandemics (1957, 1968, 2009); or SARS (2003), we’d like to think that vast strides in medical knowledge and practice have reduced the risk of emerging infectious diseases to ourselves as individuals and to our civilization as a whole. While that’s undoubtedly true to some extent, the COVID-19 pandemic has served as a forceful reminder of how rapidly spreading and broadly disruptive an emerging pathogen can be even with the tools at our disposal. (Readers with a penchant for history are encouraged to find a copy of Daniel Dafoe’s “Diary of a Plague Year;” its observations of human behaviors and infection control measures and their varied success during the 1665/66 London plague read shockingly similar to modern day and reinforce how little we have actually progressed in our reactions to these events.) Optimists or not, we must as a species face the reality that novel emerging and zoonotic pathogens, and perhaps ones of lab-derived origin, constitute a major threat to humanity and one which will be visited upon us with sporadic yet unceasing monotony.

Novel avian via swine to human reassortments of Influenza A have been recognized as a major potential of such pathogens, and since 1952 have been the target of a worldwide surveillance network. Now known as the Global Influenza Surveillance and Response System (GISRS), consisting of more than 140 National Influenza Centers plus other facilities, this loose organization works to collect, sequence, and analyze circulating Influenza A strains to provide (among other things) an early warning of appearance of novel, potential pandemic strains.1 Such information is critical if we wish to be any better off at all than Dafoe’s protagonist of 1665; so the question arises, “What if there was a way to do this with a capability to detect any new pathogen, not just one likely suspect?”

Such a capability is not only possible, it is technically achievable at relatively low cost and complexity with tools in hand at this very moment. At its front end, such an approach could rely on widely distributed, sample anonymized metagenomics. This is the molecular biology technique of taking a total nucleic acid sample (in this instance, from a person); using next generation sequencing methods to randomly sample and sequence elements of all DNA and RNA present; then bioinformatically removing all of the expected host sequences and sieving through the leftover non-host material for whatever is present. The technique is agnostic as to relatedness to other known pathogens and can provide population-wide prevalence for any definable known or newly identified target. Below, let’s consider some of the stages and processes involved along with the challenges to implementation.

Sample collection

Respiratory pathogens, due to their mode and ease of transmission, are the first priority here; so any such system should be integrated into collection from respiratory disease testing streams distributed around the world. Rather than reinvent the wheel, integration with the existing Influenza A network, but with a focus on samples testing negative for Influenza A, would seem to be an obvious starting point to build from as needed.

Challenges in this step include:

  • Ensuring sample anonymization, bearing in mind that the unfiltered raw metagenomic data will include information traceable to source identity — so strong biosecurity measures to ensure removal of human sequences prior to data availability for analysis are essential.
  • Ensuring enough geographically diverse sampling sites and unfettered access to a random sample of specimens from all sites. Put bluntly, political interference in the form of hiding or diverting anything from a broad and random sample stream will destroy the integrity of this approach. Among all the challenges considered in this article, this constitutes probably the biggest and hardest to address; only through an appreciation that collection of such information is in the long run in the best interests of all peoples and nationalities, can this be successful.

Technical process of raw data collection

DNA and RNA extractions, followed by library preparation as needed and next generation sequencing (NGS), can be done on a number of platforms.

Challenges in this step include:

  • NGS devices, availability, and cost. At first blush, low-pass sequencing with relatively low per-base accuracy is likely “good enough” for this application. This combines the lowest cost approach with sample pooling (which, incidentally, would assist in ensuring sample anonymity). The exact mechanism and system for raw data collection is unlikely to be critical, meaning any designated collection center could use an available NGS infrastructure; where such is not available, the lowest cost systems such as currently embodied in nanopore-based technologies are a low barrier to entry.

Bioinformatic, computational, and databasing steps

The process suggested here is clearly computationally intense, and requisite hardware and expertise are likely out of scope for many smaller data collection sites. In addition, maintaining a single uniform process and data flow across multiple sites is desirable.

Challenges (and likely solutions) for this part include:

  • Computational capacity. This would seem readily addressed by dynamic, scalable, off-site cloud computing capacity. This was demonstrated effective for both cost and purpose in analogous workflows more than a decade ago.2 An additional benefit of a cloud computing–based approach would be enforced bioinformatics pipeline uniformity between data source sites.
  • Ability to accept input from multiple systems, including both short-read and long-read NGS technologies, to a single cohesive data type. Strategies such as automated tiling of short-read data into longer reads are already normal practice and this point would not appear to be a significant hurdle with existing bioinformatics pipelines.
  • Searching through masses of data for meaningful sequence signatures such as a new sequence variant of a known pathogenic organism and/or a spike in general population prevalence of a family of related sequence variants. This sounds like a perfect application of AI (artificial intelligence) routines — to search for and assess novelty sequence similarity to known pathogenic organisms and abnormal statistical variations in prevalence — all potential early warning signs of a wider and more dangerous emergence.
  • Redundancy, data integrity, and accessibility of data: technically, this could be addressed through multiple mirrored database sites with appropriate access controls. The challenges here are less technical than again, political in nature.


Technical solutions to medical problems are great, but what does it cost to do something like this? A number of complex factors, including the depth of coverage needed for the process to be meaningful, would require thorough analysis to inform things such as level of pooling possible; but we can at least make some rough estimates. Presently, Illumina targeted 16S-based metagenomics suitable for bacteria only is estimated at $18/sample3 while another source suggests whole genome shotgun sequencing (WGS), which is nontargeted, wide spectrum metagenomics can be done commercially at $150/sample.4

Taking these as upper and lower bounds and knowing the GISRS processes ~1 million samples per year, the cost of this program would be somewhere between $18 million and $150 million per year. Considering this upper bound is for a commercial service (including markup), and that some degree of sample pooling is likely viable, a true value towards the middle of this range or even a bit lower seems most plausible. A midpoint of $84 million per year, in perspective to other medical and research expenditures, for a worldwide target agnostic early warning system for new, emerging, or known but suddenly expanding pathogens sounds like a bargain. In fact, even a ten-fold increase on this (posited below with regards to detection scale) sounds like a very defensible public health cost.

Scale of detection and actionable responses

At an estimated 1 billion human Influenza A cases worldwide per year, the 1 million GISRS samples represents 0.1% of total cases so something on order of 1000 cases presenting similar to Influenza A would be needed before one would likely be inducted into this process.This could, however, be improved upon by including a nonrandom sampling component; that is, preferential induction of samples in clusters and where a causal agent is not readily identified could be employed to bias these odds more favorably. A ten-fold increase in total sampling numbers would obviously improve on these numbers as well and seems within a reasonable cost scale.

Any expenditure is only justified if it can lead to some positive actionable outcome. In this system, appearance of a significantly novel sequence entity, almost certainly with recognizable similarity at least at the inferred protein sequence level to known pathogenic organisms, could be detected and flagged out. An immediate increase in sampling density in the geographic region could be undertaken, with aggregated sequence data used to generate a more complete picture of a full pathogen genome. Directed (RT)-PCR testing could be developed and deployed by reference centers, albeit with incomplete validation, yet still useful for rapid bulk screening of populations within weeks. The pathway of proceeding from this through aggregation of sequence, epidemiologic, and clinical presentation data is one well understood and recently practiced by the medical community worldwide for SARS, MERS, and COVID-19 to name just a few most recent examples. In other words, the use of widespread metagenomic surveillance would not change the pathway of response on detection of a novel agent; but it would perhaps speed it up by precious weeks to months at early stages. Notably, COVID-19 itself was initially identified by metagenomics techniques.5

Expansion beyond respiratory pathogens

While respiratory diseases are the most immediate application of this approach due to inherent transmission risk and a pre-existing surveillance network, other classes of infectious disease can be targeted through metagenomic surveillance as well. For those readers looking for more depth on application in other contexts, a recent review can be found in Nature Microbiology.6

Final thoughts

As a world, we have been using limited surveillance in the context of Influenza A and have reaped the benefits of this in early detection of novel pandemic potential strains and for guidance on vaccine composition. Expansion of this pre-existing network and inclusion of broader range detection technologies in the form of NGS is technically feasible and not cost prohibitive. Major hurdles exist but these are almost exclusively in the political will to allow transparency and unfettered random sample access. Overcoming these will require dialogue and resolve by entities at national and supranational levels. A recognition that diseases, particularly ones of an emergent pandemic nature, observe no national boundaries in today’s highly mobile and connected world should help to convince all potential contributors to such a network that this form of surveillance is in their best interests. While this lesson of COVID-19 is fresh in our minds, it is an ideal time to consider whether adoption of generalized metagenomic surveillance might not be humanity’s most immediately cost-effective defense against our next scourge — whatever it may be.


  1. Ziegler T, Mamahit A, Cox NJ. 65 years of influenza surveillance by a World Health Organization-coordinated global network. Influenza Other Respi Viruses. 2018;12(5):558-565. doi:10.1111/irv.12570.
  2. Angiuoli SV, White JR, Matalka M, White O, Fricke WF. Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing. PLoS One. 2011;6(10):e26624. doi:10.1371/journal.pone.0026624.
  3. Cost of NGS. Accessed November 18, 2022.
  4. Robertson R. 16S rRNA gene sequencing vs. Shotgun metagenomic sequencing. Published July 20, 2020. Accessed November 18, 2022.
  5. Zhu N, Zhang D, Wang W, et al. A novel Coronavirus from patients with pneumonia in China, 2019. N Engl J Med. 2020;382(8):727-733. doi:10.1056/NEJMoa2001017.
  6. Ko KKK, Chng KR, Nagarajan N. Metagenomics-enabled microbial surveillance. Nat Microbiol. 2022;7(4):486-496. doi:10.1038/s41564-022-01089-w.