Next generation sequencing and library types

Oct. 19, 2014

Next generation sequencing (NGS) is slowly but steadily making its way from research labs to the clinical setting, and it will undoubtedly continue to do so as associated costs and turnaround times decrease, and well-validated applications increase. With this in mind, we’re going to spend the next three installments of The Primer dealing with some aspects of setting up an NGS assay that today’s molecular laboratorian will find useful. Should (or more likely, when) opportunities arise for the introduction of NGS testing in the reader’s laboratory, this should help to demystify some of the variations in NGS methods and assist the laboratorian in selecting the approaches best suited to the intended clinical application. Bear in mind that this is a rapidly developing field, so many of the finer details may differ between what’s described here and what will actually be available in 12 or 18 months. However, the broad concepts we’ll review are likely to remain unchanged and relevant. 

This month’s discussion, then, will center on the form and utility of different NGS library types. The reader may have heard a lot about sequencing whole patient genomes and may have the impression that all NGS applications do just that: sequence the entire six billion nucleotides (approximately 3.2 billion per haploid genome, twice over to account for the fact we’re diploid and have two copies of all autosomes). While this has been done, it’s still a daunting, expensive, and time-consuming task, and the relevance of much of the data obtained is little understood at this time. And, that is not what we mean by an NGS library in this column’s context.

The concept of “library” here is this: regardless of platform, NGS applications invariably have a library preparation step between the raw input biological sample and the actual nucleic acid sequencing determining chemistry. This library generation aims to select out the meaningful subset of the input sample genetic material in the context of a particular biological question, thereby simplifying and streamlining the NGS process and subsequent bioinformatics. This can provide very significant savings in costs and time and is, with current approaches, almost essential for any clinical or near-clinical applications. 

A library is by nature, however, exclusive of some material, and thus thought must be given to NGS library preparation methods so that the intended biological questions can be focussed on and answered most effectively. Note that our discussion here will by necessity be in a somewhat generalized, platform-agnostic form, which will differ in some specifics from one actual implementation to another. The reader with a particular NGS system in mind may thus need to consider some nuances beyond our discussion here, but this shall at least serve as a generally true framework on which to build.

Whole genome sequencing

Let’s start with a brief overview of how a “non-exclusive” whole genome library would be prepared as a basis for comparison with the more focused approaches. A tissue sample of some form is obtained from the subject of interest. Since our goal here is unbiased, whole genome sequencing (WGS), the tissue type should not matter much; to a first approximation, all cells should have the same nuclear DNA content. The process then essentially flows as follows:

  • DNA is extracted.
  • The purified DNA is broken by mechanical or enzymatic means into small fragments of a narrow size range (usually, 300–500 base pairs). This is done to accommodate the “read lengths,” or amount of any single piece of DNA, that an NGS instrument can accurately sequence. By taking the DNA from many identical cells and breaking it randomly, one assumes that there is some overlap between fragments, and “tiling” is possible to reassemble the short pieces into longer contiguous reads. (See the October 2013 installment of The Primer for a review of tiling: Brunstein J. High throughput sequencing: next generation methods. MLO. 2013;45(10):36-39.) 
  • The fragmented DNA pieces are ligated to small, known sequence oligonucleotide adapters at either end. 
  • The user should now in effect have a sample which consists of millions of short pieces of DNA, each of which carries a random short fragment of the original tissue DNA, and which has known sequence ends.
  • A limited amount of PCR replication is now done, using primers which match the known end sequences. This is done to amplify the fragments, ensuring there are many copies of each one. The low number of cycles helps to avoid too much bias (over-representation of fragments which amplify faster than others).
  • Often, an additional size selection step is done here to ensure the materials going forward are still in the desired 300–500 base pair size range. Different size exclusion techniques including agarose gel electrophoresis and cutting out/recovering DNA from the desired size can be used to do this. This step helps to remove any very short DNA molecules which can be created by ligation of the end adapters to each other, and which wouldn’t carry any meaningful (sample-derived) sequence information.
  • At this point, the somewhat amplified, end-tagged, size-selected collection of “random” DNA fragments constitutes the WGS library. The library is dispersed into tiny partitions such that each partition has on average just one DNA molecule as template in it, and sequenced by a platform-specific variation on methods which measure the base-by-base growth of new strands complementary to the template. 
  • Computational methods (under the collective head of “Bioinformatics”) then take all of these short reads, tile them together, and output a “whole” genome sequence from the starting material. “Whole” is in quotes because some particular sequence elements, particularly highly repetitive regions, are not very effectively captured by this approach, and thus there are usually some incomplete areas relative to an actual whole complete genome; but it’s close to whole. A “resequencing” approach (discussed below) is also possible as an alternative to pure tiling. 

The question then is, do we really need to look at the whole genome for a given sample? Recall that only about 1% of the human genome codes for expressed proteins; the other 99% is made up of things such as introns, chromatin structural regions, life centromeres and telomeres, non-protein coding expressed RNAs, and a lot of other sequence material whose utility we don’t much know about. (This used to be commonly referred to as “junk DNA,” but the term is falling out of favor as we come to appreciate that it’s likely to have useful or even critical functions not fully elucidated yet.) This 1% “exome” is by far the most informative portion of the genome, with changes in every base pair being immediately analyzable in silico as silent, misense, or nonsense mutations (along with insertions, deletions, translocations, and other defect classes). 

The exome approach

All other things being equal, if we could prepare an NGS library directly off the exome, we would save about 99% of the cost and effort compared to WGS but maintain most of the useful information; it’s pretty hard to not find that attractive. How then to go about doing this?

The answer is to start from messenger RNA (mRNA) and perform a type of “RNA sequencing.” Essentially, this would then proceed as follows:

  • Select a tissue sample of interest. Note, however, that here, unlike with WGS, selection of tissue type is very important. Not all genes are expressed in all tissues, so if particular genes or gene isoforms are of interest, you need to ensure you start with a tissue type where they’re expected. 
  • mRNA extraction is done; often, this includes an affinity purification step which takes advantage of the fact that eukaryotic mRNAs have a 3’ poly-A tail, while non-coding RNAs do not.
  • Once mRNA is isolated, it’s converted to DNA in vitro by the use of reverse transcriptase enzymes—virally derived DNA polymerases which can make a DNA strand based on an RNA template. Additional molecular manipulations are then used to degrade the RNA and generate DNA “second strands,” ending up with a set of dsDNA molecules which represent the same sequences (and, usefully, in approximately the same relative abundance) as the mRNA starting material.
  • This DNA is then treated as if it were extracted DNA and put through the same steps we covered for WGS above (shearing, adaptor ligation, possibly limited amplification, size selection, and then dispersion and actual sequencing). 

While tiling can also be used here, much of exome sequencing is considered to be “resequencing.” That is, a human reference sequence can be used, and each short sequencer read can be aligned back against that, with mismatches being considered mutations. These mutations are then examined against various databases to assess their significance. It’s worth noting that this method can also detect in some meaningful instances, if not an actual mutation occurring in a non-coding region, at least its impact. Imagine for instance an intronic mutation which influences splice site selection; because of the semi-quantitative nature of this method, a change in isoform ratios away from a normal value for the tissue type tested is detectable (as are of course more significant impacts such as exon skipping). 

Overall, the exome approach is both experimentally and computationally much simpler (and thus faster and cheaper) than WGS but retains much of the useful information. For now, it provides the most cost-effective approach for generating a broad picture of the functional genome in a specimen, and is used both in many early adopter clinical settings as well as many popular, public “sequence my genome” services.

NGS panels

Last, let us consider cases where there is an interest in a large but limited subset of particular genes—not the whole genome, or even the whole exome, but more than just one or two genes. This sort of situation frequently arises in the context of oncology, where the characterization of a set of oncogenes on a set of pathways can help stratify cases and select best therapeutic options. These may consist of 30–150 particular target genes, with a desire to have high throughput by analyzing multiple different specimens within a single NGS run. Generally referred to as “NGS panels,” this is a third form of library which, depending on design, may either start with extracting genomic DNA from a test sample or with mRNA from a relevant tissue subtype. 

In either case, some form of selection of targets of interest is performed. This can be by gene-specific PCR, leading to a pool of amplicons (already of the desired length, although in this case with defined endpoints) or by hybridization capture to selectively retain mRNAs or genomic DNA coding only for the particular genes in the initial nucleic acid purification steps. This genetic material is then a very focussed subset of the source genome from which to prepare the library material for dispersion and sequencing, following either of the paths above as appropriate to sample type. (Note that for a direct PCR amplified genomic DNA panel type, the size shearing and adapter ligation steps may be dispensed with as these are effectively carried out in the PCR step). 

A particularly clever aspect of NGS panels is that it is possible, either in the direct PCR stage for genomic DNA-based panels or in the adapter ligation step for exome-based panels, to use PCR primers or adapters respectively which contain an internal sequence element (commonly referred to as a “barcode”’) that is distinct to each sample prepared. This then allows multiple panel libraries from different samples to be mixed together prior to the dispersion and actual sequencing steps. By doing this, each individual sequence read will start with a sample-unique “barcode,” allowing it to be associated back to the sample of origin. This allows many different unrelated panel sample libraries to be mixed together in one dispersion and sequencing run, thereby taking full advantage of the massively parallel nature of NGS technology and allowing for high throughput with respect to number of samples per run. This makes panels highly cost-effective and of relatively low labor input on a per-sample basis.

These, then—WGS, exome sequencing (“RNA-seq”), or NGS panels—represent three different forms of NGS differing primarily in how the library material is prepared. Each method has particular strengths and weaknesses. Depending on the type of research or clinical question being addressed in an NGS assay, choice of the best method helps to make results cost-effective and most directly meaningful.

John Brunstein, PhD, a member of the MLO Editorial Advisory Board, is President and CSO of British Columbia-based PathoID, Inc.